1 Introduction

Virtual reality (VR) represents a significant advancement in technological paradigms, offering immersive digital environments that redefine conventional user interactions. The potential applications of VR, especially when integrated with eye-tracking, are vast and diverse. This survey endeavors to provide a comprehensive examination of the intersection between VR and eye-tracking, establishing a foundational platform for our forthcoming research trajectory in psychology.

Within the VR context, users are exposed to an array of stimuli that closely simulate real-world experiences. The democratization of VR technology, found in the increasing affordability and ubiquity of VR headsets, has catalyzed its adoption across a myriad of sectors. Complementing this is the integration of eye-tracking technology, which captures users’ gaze patterns, providing a nuanced perspective into their cognitive and perceptual processes. While early iterations of this technology were constrained by the necessity for intricate equipment, recent innovations have facilitated its seamless amalgamation with VR headsets, Augmented Reality (AR) interfaces, and conventional display systems.

The confluence of VR and eye-tracking has ushered in a plethora of applications, from entertainment to education and specialized research. Yet, this union is not without challenges. Considerations pertaining to individual optics, inherent ocular deviations, and issues of spatial accuracy underscore the necessity for continued research and refinement.

In this survey, we undertake a meticulous exploration of the VR landscape, delineating between commercial solutions and bespoke systems crafted for research-specific needs. Concurrently, we assess the broader ramifications of eye-tracking across an array of disciplines, emphasizing its quintessential role in procuring accurate attention metrics. Notably, while several VR systems deploy head-tracking methodologies to ascertain user orientation within virtual spaces, it is the granular accuracy of eye-tracking that proffers unparalleled insights into attentional dynamics.

Our primary objective is to traverse the existing literature, elucidating the nuances of attention tracking within VR. This spans a spectrum from pragmatic applications to the frontiers of hardware evolution. This synthesis seeks to disambiguate the complex relationship between VR and eye-tracking, elucidating present capabilities, prospective developments, and existing challenges. For a contextual comparison, Table 1 juxtaposes our insights with extant reviews, underscoring the unique contribution of our work. It is with this foundational knowledge that we anticipate our foray into the domain of psychology, harnessing the synergistic potential of VR and eye-tracking.

Table 1 Comparison between this survey and previous VR studies

This paper is organized as follows: first, the methodology to find previous work is presented; then, the basis of eye functioning is summarized to understand the movements that are frequently monitored in eye-tracking and the metrics that are commonly used in studies that use this type of hardware. Commercial and custom hardware are following analyzed by evaluating work solely based on end-user applications or focused on building an eye-tracking system. Following, the eye-tracking procedure, together with the calibration, is detailed on the basis of works implementing their own system or pipeline, instead of delving into commercial software. Optimizations concerning VR rendering are then presented to show current and future techniques aimed at fusing the understanding of how the eye works and efficient rendering. An in-depth review of the applicability of eye-tracking is finally introduced and organized in several expertise fields, in order to showcase the large number of articles using this technology. This survey ends with a discussion that remarks the benefits and drawbacks of this technology, and the conclusions, where the maturity level and future trends are analyzed.

1.1 Eye-tracking benchmarks

VR encompasses a wide spectrum of research fields, among which the role of computer graphics in content generation and visualization stands out. Recently, some studies have shown the benefits of using eye tracking in VR in different areas such as interaction and attention tracking. Regarding interaction, Luro and Sundstedt (2019) made a comparison between gaze aiming in VR and traditional controllers in a point-and-shoot task. They studied different target trajectories and speeds from the collected data using the system usability scale (SUS) and cognitive load questionnaires (NASA TLX). Results indicated that gaze can be used as a replacement for aiming in VR without negatively affecting task performance and comfort. Participants showed less physical demand using gaze tracking and the SUS reported similar results for both methods. Other studies such as the one conducted by Joo and Jeong (2020), proposed a user interface (UI) based on eye tracking and demonstrated that it reduces time spent on simple operations, thus avoiding dedicated controllers and their usage time. In addition, Clay et al (2019) demonstrated the usefulness of eye-tracking combined with VR exploring methods and tools that can be used in experimentation.

Regarding the study of user attention, other investigations compare the use of head tracking with eye tracking in recording user attention to different areas of interest. For example, Llanes-Jurado et al (2021) made a comparison between eye tracking and head tracking by studying multiple areas in a virtual environment. They showed that there is a high similitude between head and eye horizontal gazes and suggested a new threshold for areas of interest in virtual environments in order to compare both technologies. In another study, they presented rule-based criteria to calibrate fixation identification focused on different features (Llanes-Jurado et al 2020). Gaze tracking was also compared against controller tracking to show that the first is better suited for aiming at fast-moving targets, with faster reaction times and less physical effort while following the target path (Luro and Sundstedt 2019). In addition, Blattgerste et al (2018) compared eye-tracking-based interaction in VR and AR with head-tracking evaluating the benefits of eye-tracking. They showed that gaze-tracking outperforms head-tracking in many features such as speed, user preference, and task load.

2 Methodology

A wide variety of research articles was reviewed in this work. To this end, Scopus was used as the principal cross-library search tool. Given the purpose of this survey, the Scopus query was performed by seeking the following terms in the title with the AND operator: “eye”, “tracking”, “gaze”, “virtual”, “reality” and “vr”. With these words, the next four searches were performed:

  • TITLE (eye AND tracking AND virtual AND reality): 118 results.

  • TITLE (eye AND tracking AND vr): 35 results.

  • TITLE (gaze AND tracking AND vr): 6 results.

  • TITLE (gaze AND tracking AND virtual AND reality): 18 results.

The results from the Scopus search are depicted in Fig. 1, whereas the top ten journals publishing these documents are shown in Fig. 2. The bottom image in Fig. 2 presents the journals from which the works considered in this review come; those marked with a star have been published in the previous top ten journals, whereas blue-colored ones belong to Computer Graphics journals and orange bar agglutinates proceedings of conferences.

Fig. 1
figure 1

Number of investigations obtained from the above Scopus queries, from 1994 to 2022

Fig. 2
figure 2

On the top side, the distribution of research among the top-10 publishing journals in the VR field. The inner color refers to the publication mode. On the bottom image, the most popular journals in our bibliography (star: top-10 journal from the above image, blue: Computer Graphics journal, orange: conferences)

With eye-tracking technology rising and continuously evolving, the proposed searches were filtered to obtain studies since 2019 and omit those with discontinued devices and surpassed technology. With this filtering, a total of 134 articles were considered. Next, manual filtering was carried out to ensure the found articles made use of virtual reality and eye-tracking technologies. With this filtering, the number of finally considered studies was 112 by removing those that were unavailable or non-English written documents. From these, the bibliography was enlarged with work found in the state-of-the-art and experimentation of already included documents. This was especially relevant for custom eye-tracking devices, datasets and eye-tracking methodologies which establish a comparison with previous work.

In the case of rendering techniques, the Scopus search was TITLE-ABS-KEY(foveated AND rendering) since most of the included studies do not focus on VR devices but check the user’s experience using different foveated rendering approaches. Therefore, 208 documents were found, from which 30 were finally selected according to their alignment with the topic of this survey, their publication date and relevance.

3 The visual perception

There are numerous studies that explain how human vision works by analyzing everything from the eye to the brain processes involved (De Valois and De Valois 1980; Livingstone and Hubel 1988; Wandell 1995; Palmer 1999; Li 2014). In this section, we are going to focus on eye-tracking, how it is performed and why it is useful.

First of all, we must differentiate between the terms “eye-tracking” and “gaze-tracking”, as they are often used interchangeably but denote distinct aspects of visual monitoring. Eye-tracking is a broader term that encompasses the study and measurement of eye movement. This includes the position of the eye and its movement within the socket, which can involve tracking rapid movements known as saccades, periods where the eye remains still (fixations), and the dilation and constriction of the pupil. Eye-tracking provides raw data on where the eyes are positioned and how they move. It is a comprehensive view of ocular activity without necessarily tying it to specific points of focus in the environment.

On the other hand, gaze-tracking is more specific and is concerned with determining where a person is looking in their environment or on a screen. It takes the data from eye-tracking and interprets it to provide a point or region of focus. Essentially, while eye-tracking gives you the mechanism of the eye’s movement, gaze-tracking tells you the outcome or result of that movement in terms of focus points. For instance, in a virtual reality setup, gaze-tracking would indicate what object or scene component the user is looking at.

To determine where an individual is looking, eye-tracking technology often employs a principle called corneal reflection. In this methodology; an infrared light source illuminates the pupil, generating a reflection in the cornea. An infrared camera captures this reflection and delimits the center of the pupil, deducing the rotation of the eye and determining the direction of the gaze. The location of the fovea varies in each individual due to its geometrical peculiarities and thus has to be taken into account for gaze tracking because of the lack of alignment between the optical and visual axis (see Fig. 3). For this reason, a calibration procedure is applied to optimize eye-tracking detection (Tobii 2022a, b). This eye-tracking calibration is covered in more detail in Sect. 5.1.

Fig. 3
figure 3

A diagram illustrating how the fovea, the area of the eye responsible for sharp central vision, can shift from its normal position on the optical axis, requiring the use of calibration to accurately measure eye movements and position in eye-tracking systems (Tobii 2022a, b)

3.1 Eye movements

From a research perspective, eye movements in VR offer invaluable insights into a user’s attention, cognitive state, and emotional response. This data is instrumental for a range of fields, from psychology to neurology (Just and Carpenter 1980; Rayner 1998; Jacob and Karn 2003; Leigh and Zee 2015). For VR developers, understanding gaze patterns can optimize content placement, ensuring that key elements capture users’ attention. Furthermore, in training scenarios, eye movements can evaluate a trainee’s observational skills and focus. Eye movements can be classified into three basic types (Duchowski 2017):

  • Fixations: Occurs when the eye stops over an object or position to collect visual information. The duration of fixation is usually variable, however, the longer the fixation, the more visual information is collected and processed.

  • Saccades: Rapid, ballistic movements of the eyes that abruptly change the point of fixation. Because saccadic movements occur at high speed, vision is impaired. This is why they are not as important in eye-tracking as fixations. However, they do reveal information about the direction of the user’s gaze and the order of fixations and visual attention.

  • Smooth pursuits: Much slower tracking movements of the eyes designed to keep a moving stimulus on the fovea.

In addition to these basic movements, other movements can be mentioned that may be of interest depending on the study being performed:

  • Microsaccades: Extremely small, jerk-like eye movements that occur involuntarily when a person attempts to fixate their gaze on a single point. Even when we try to keep our gaze steady, our eyes are constantly making these minute adjustments, typically at a rate of around one to two per second.

  • Vestibular Ocular Reflex (VOR): An essential reflex that stabilizes images on the retina during head movement by producing an eye movement in the direction opposite to the head movement. This reflex allows us to maintain a clear visual focus on an object even when our head is moving. Discrepancies between the user’s real-world VOR and the simulated visual environment can lead to feelings of discomfort or motion sickness.

  • Vergence: A type of eye movement where the two eyes move in opposite directions (Howard 2002). This mechanism is critical for maintaining binocular vision and depth perception. There are two types of vergence movement:

    1. 1.

      Convergence: The eyes move toward each other when viewing a close object.

    2. 2.

      Divergence: The eyes move away from each other when viewing a distant object.

Discrepancies between real and simulated vergence movements can lead to discomfort or motion sickness.

  • Nystagmus: A vision condition characterized by involuntary, repetitive eye movements. These movements often result in reduced or limited vision as the eyes uncontrollably oscillate in a quick, jerky, or pendular manner. Nystagmus can occur in a horizontal, vertical, or rotary pattern, and it often involves both eyes. The presence of nystagmus can significantly impact the precision and utility of eye-tracking data, as the erratic eye movements may skew gaze-tracking metrics or result in misinterpretations of gaze direction or intent.

  • Blink: A natural and involuntary action that serves several essential functions, such as protecting the eye from irritants and keeping the eye moist by spreading tears over its surface. In an average person, blinks occur approximately every 4–6 s, or about 15–20 times per minute.

  • Ambient eye movements: Generally associated with the early phase of visual perception and are characterized by a series of quick eye movements (or saccades) and brief fixations. They are used to get a rapid overview or 'gist’ of a scene and help to orient the viewer in their surroundings.

  • Focal eye movements: Come into play after the initial ambient phase and involve longer, more deliberate fixations. Focal eye movements are used for detailed examination of objects or features of interest within a scene.

3.2 Conditions of testing

When analyzing eye-tracking data, conditions of testing must be taken into account. This refers to the specific parameters, environment, and controls established for a test or experiment. Properly defined conditions are crucial to ensuring that the results of a test are valid, reliable, and can be replicated. Depending on the context (whether it is a scientific experiment, product testing, clinical trials, or others), these conditions can vary. In this regard, when eye-tracking systems are used, we can speak of two conditions:

  • Head-free: refers to setups or systems that allow the user’s head to move freely without constraining it to a fixed position.

  • Head-still: typically refers to a condition or requirement in which a participant or subject is asked to keep their head stationary or motionless.

In VR setups, the “head-free” condition is predominant. This approach is favored because it enhances immersion and naturalism, is aligned with the head-tracking capabilities of VR systems, and increases user comfort by reducing motion sickness. Additionally, it supports interactivity, allows full 360° experiences, and distributes movement effort between the eyes and neck, reducing strain. While specific studies might occasionally require limited head movements, such as Sipatchin et al (2020) or Sipatchin et al (2021), general VR applications prioritize a head-free experience for a more comprehensive and immersive user engagement.

This testing condition introduces a new dimension of tracking: head tracking. While eye and gaze tracking concentrate on the eyes and where they focus, head tracking is a broader technology that captures the position, orientation, and movements of a user’s head in real-time. This becomes especially relevant in environments where the whole orientation of the viewer’s perspective can change based on the movement of their head. Head tracking is crucial in determining how a user is physically orienting themselves within a space. In immersive environments like VR or AR, head tracking ensures that the visual display adapts to the user’s head movements, offering a 360-degree perspective. For instance, if a user looks up or turns their head to the side in a VR simulation, the visual scene will adjust accordingly, giving the sensation of "looking around” within the virtual space.

3.3 Metrics

The aim of measuring and analyzing eye movements is to study the user’s attention, how he/she distributes it and what determines this distribution, which can be called attention tracking. Attention tracking in VR is a more comprehensive measure, aiming to gauge the depth of a user’s cognitive engagement. It is not just about where users are looking, but how engrossed they are. By merging eye movement data with other metrics, such as the ones listed below, VR systems can deduce how captivating a particular scene or object is for the user (Duchowski 2017).

The following are some of the metrics used in the different studies analyzed in this paper:

  • Fixation count: Refers to the number of fixations carried out per area of interest (AOI).

  • Time to First Fixation (TTFF): A metric used in eye-tracking research to measure the time it takes from the onset of a stimulus to the moment the viewer’s gaze fixates on a particular point or area of interest for the first time.

  • Duration of First Fixation (DFF): Represents the length of the viewer’s initial gaze fixation upon spotting a particular stimulus or area of interest.

  • Transition order between fixations: Denotes to the sequence or order in which a viewer’s gaze moves from one point of fixation to another.

  • Dwell-time: In gaze analysis, this term represents the duration spent focusing on a single object or position. Its computation relies on:

    1. 1.

      Identification of a fixation

    2. 2.

      A temporal window determining a threshold for the fixation’s duration.

  • Total fixation duration: Refers to the time period for their fixation points falling in a certain AOI including the duration of the first fixation.

  • Saccade count: Quantifies the number of rapid eye movements or shifts made during observation.

  • Saccadic velocity: Denotes the rate at which the eyes transition during a saccade.

  • Amplitude of saccades: Measures either the angular or linear distance traversed by the eyes during a saccadic movement.

  • Reaction time: The interval required for a saccade to commence following the display of a stimulus or cue.

  • Search time: As gauged using eye-tracking methodologies, this metric determines the duration needed for a participant to visually identify a target or specific area of interest.

  • Microsaccade count: Represents the tally of minute, involuntary eye movements observed during focused gaze.

  • Fixation/Saccadic ratio: This metric, commonly employed in eye-tracking research, examines the relationship between phases of ocular stability (fixations) and swift eye motions (saccades). The ratio elucidates how observers assimilate visual data, offering clues about their cognitive condition or the intricacies of their task.

  • Blink rate: Denotes the frequency of a participant’s blinking per minute. It can be used to measure fatigue (Stern et al 1994), engagement (Ranti et al 2020) or cognitive load (Biondi et al 2023), among others.

  • Blink duration: This metric quantifies the duration during which an individual’s eyes remain closed in a blinking episode. It serves analogous purposes to the blink rate.

  • Eye status: Distinguishes between phases when the eye is open and when it is closed.

  • Gaze direction: Reflects the eyes’ alignment or orientation concerning a specific focal point. It reveals an observer’s current visual attention or point of interest.

  • Gaze position: It is the specific point in space or on a surface where a person’s eyes are directed or focused. It represents the spatial location that a viewer is currently looking at.

  • Gaze velocity: Measures the speed at which one’s gaze, or point of focus, changes position.

  • Gaze acceleration: Denotes the rate at which the speed of one’s gaze changes. Just as gaze velocity measures how quickly eyes move from one point to another, gaze acceleration measures how fast this velocity changes, either increasing or decreasing.

  • Eye position: Illuminates the eyes’ orientation or position concerning a reference, be it the observer’s head, a display, or an external scene. It denotes the eyes’ spatial arrangement and alignment at a given juncture.

  • Pupil diameter: It can be used to determine pupil dilation/contractions which can determine strong emotional stimuli, acute attention and working memory load (Slovak et al. 2022; Duchowski et al 2020).

  • Pupil position: Indicates the exact spatial location of the pupil.

  • Ocular deviation: Pertains to the eyes’ misalignment, meaning that the two eyes do not point exactly in the same direction. It is commonly seen in conditions known as strabismus or squint.

  • Spatial accuracy: Within the realm of eye-tracking, this denotes how precisely the system can determine where a user is looking. It is usually defined as the difference or error between the position recorded by the eye tracker and the actual position of the user’s gaze in the real world.

  • Transition entropy: A metric in eye-tracking analysis to quantify the predictability or randomness of a person’s gaze transitions between different areas or points of interest. It can be used as a measure of visual scanning efficiency (Shiferaw et al 2019).

  • Spatial distribution: Illustrates the spread or arrangement of gaze points or fixations across a particular visual field or area of interest. In other words, it describes where the user tends to look in a scene or interface.

  • Distance between users’ gaze and a target: Measures the spatial gap between a user’s gaze and a specific target.

  • Visual field: Represents the full extent of the area that can be seen when the eye is directed forward, encompassing the central and peripheral vision.

  • Near Point of Convergence (NPC): The closest point in space to which both eyes can direct their gaze before one or both eyes begin to turn outward, losing binocular alignment. In other words, it is the point at which your eyes can no longer maintain a coordinated focus on a near object, and one eye "breaks” or deviates from the target.

  • Positive Fusional Vergence (PFV): Also known as "convergence reserves”, represents the ability of the eyes to turn further inward (converge) than is necessary for binocular single vision. It is a measure of the extra convergence capacity the visual system has beyond what’s currently being used for a given task.

  • Near/far dissociated phoria: Gauges the eyes’ resting orientation when they are not synchronized to focus. It essentially describes the tendency of one eye to drift either inward (esophoria) or outward (exophoria) when the other eye is covered or when binocular vision is otherwise disrupted.

  • Convergence Insufficiency Symptom Survey: A validated instrument used to quantify symptoms associated with Convergence Insufficiency (CI), a common binocular vision disorder characterized by the eyes’ inability to work together efficiently at near distances.

  • Eye-tracking delay: Also known as "latency”, is the time interval between the occurrence of an eye movement or gaze event and the system’s ability to detect, process, and potentially respond to it.

  • Conditions of testing.

Table 2 summarizes the most commonly used units for each one of these metrics.

Table 2 Eye-tracking metric units

4 VR headsets

In this section, we provide an overview of eye-tracking systems within VR, exploring both custom-built and commercial platforms.

Custom systems emerge as intriguing alternatives in the eye-tracking domain, primarily for their potential cost-effectiveness. The research spotlight in the realm of custom systems shines on three main facets: (1) the construction of the hardware architecture, (2) the development of software aimed at providing the user’s gaze vector and (3) the assessment of deviation between ground truth and computed results.

On the commercial side, established devices offer ready-to-use solutions for a wide array of eye-tracking applications. But like any evolving technology, they present their own set of challenges and areas for improvement. Highlighted by studies such as (Llanes-Jurado et al. 2021; Borges et al. 2018), and the investigation conducted by Sipatchin et al. (2021) on the HTC Vive Pro, it is evident that there is a discernible gap between real-world performance and manufacturer claims. For instance, the HTC Vive Pro’s spatial accuracy was put to the test in an ophthalmological context using an online virtual perimetry testing application. Two distinct testing conditions were employed: head-still, to assess eye-tracking accuracy across a vast visual field, and head-free, to evaluate the effects of head movements on eye-tracking precision and potential data spillage. The results were illuminating, revealing that the spatial accuracy was not as pristine as manufacturer specifications indicated and that head movements introduced a drop in precision and an increase in data loss.

Accordingly, this section is structured to first explain which are the most frequent commercial headsets, whereas the final subsection is devoted to custom systems that built their own architecture to track the user’s line of sight.

4.1 Commercial headsets

Eye-tracking on VR can be approached using notable commercial devices, as depicted in the exhaustive compilation of commercial VR headsets provided in Table 3. These can be integrated into the VR headset, e.g., HTC Vive Pro Eye, or used isolated. Despite the number of solutions combining both technologies having significantly increased, only a few of them fuse both features. However, the vast majority of research concerning eye-tracking nowadays is based on VR-integrated solutions. It represents 57.6% of the reviewed articles published in 2021 (Fig. 4), whereas research from 2019 was mainly dominated by custom and isolated eye-tracking devices such as SensoMotoric Instruments (SMI). On the other hand, integrated solutions barely achieved the 6% of reviewed manuscripts in 2019, whereas it stood for the 28.8% of revised research from 2020.

Table 3 Commercial headsets that integrate eye-tracking
Fig. 4
figure 4

Most popular virtual reality headsets and eye-tracking devices used in 2021. The left image shows the VR headsets, while the right image shows the eye-tracking devices

Besides integrated devices, non-commercial solutions built from scratch are also frequently evaluated (which will be referred to as custom devices from now on). These solutions are harder to use as they require planning and building the hardware architecture, as well as developing the software for eye-tracking and calibration. Most of them use commercial VR headsets as the underlying infrastructure for the eye-tracking system (Table 4), however, recent work has also built custom VR devices that even enable improving headset comfort (Altobelli 2019).

Table 4 List of VR headsets that can work as the underlying infrastructure for custom eye-tracking systems and do not integrate eye-tracking by default. Resolution is represented in pixels per eye (width × height), and FOV refers to the horizontal field of view

4.2 Custom systems

Custom systems are mainly based on infrared (IR) cameras, as they enable acquiring eye data in the absence of lighting, which occurs in cave-like environments such as VR headsets. Nevertheless, they are also approached using cheap visible cameras, such as the ones integrated into mobiles, since they are focused on low-cost setups. Then, acquired data is frequently transformed using traditional image processing algorithms, mostly focused on pupil and iris detection. Image processing pipelines are mainly classified into model-based and feature-based methods whether they seek relevant points on the images (features) or shapes (model). The proposed systems are finally evaluated considering the difference between the estimated and ground truth gaze vector.

The base hardware for custom devices is mainly given by commercial VR headsets without eye-tracking integration, even reverse-engineered and 3D printed (Altobelli 2019), though a wide range of headsets can be found in the literature (Fig. 5). This includes HTC Vive (Dong et al. 2020; Chugh 2020) and BOE (Sun et al. 2021) head-mounted displays (HMD), inexpensive plastic cases for mobile phones (Drakopoulos et al. 2020, 2021), and other headsets designed specifically for case studies such as Magnetic Resonance Imaging Systems (MRI) (Qian et al. 2021) (see Fig. 6). As previously mentioned, eye-tracking architectures are mainly composed of IR cameras and light sources to perform video-based oculography (VOG). In the absence of natural lighting, IR cameras allow acquiring the participant eye with different grayscale values for pupil and iris (Sun et al. 2021). The purpose of IR light sources is to generate recognizable reflections on the eyeball (Purkinje reflection points), thus helping to estimate the line-of-sight direction (Dong et al. 2020). The number of IR light sources goes from one (Sun et al. 2021; Qian et al. 2021; Dong et al. 2020; Katrychuk et al. 2019) to eight (Lu et al. 2020), forming a ring shape which can be later identified through shape-fitting. However, a large number of IR light sources is reported to negatively affect participants (Qian et al. 2021; Dong et al. 2020) since they increase the risk of causing an undesired red-eye effect, besides causing more specular reflections; therefore, these flaws influence pupil tracking accuracy and robustness. This is of particular concern when the illumination is placed very close to the face. Rather than creating an eye-tracking architecture from scratch, Chugh (2020) opted for estimating the gaze vector using Pupil Labs hardware. The same hardware also comes with a software application to access the collected images. Yet, a key challenge with prior wireless headset solutions like Google Daydream is their high demand for processing power. To address this issue, Photosensor Oculography (PSOG) has been proposed. This technique measures the IR reflection using a limited number of IR detectors. Nevertheless, an important obstacle lies in sensor shift, which significantly deteriorates spatial accuracy (Katrychuk et al. 2019).

Fig. 5
figure 5

Comparison of frequent infrastructures for building a custom eye-tracking system. a A cross-section of a VR headset coupled with IR lights and a camera for each eye (Sun et al 2021), and b a VR headset for mobile phones, similar to the one employed by Drakopoulos et al (2021)

Fig. 6
figure 6

Eye-tracking architecture within an MRI scanner, as proposed by Qian et al (2021)

5 Eye-tracking

On the basis of previously revised devices, this section intends to further explain how is the calibration and eye-tracking stages conducted on them. The following works do not always operate over custom devices, but they implement processing layers over data provided by an eye-tracking system, either custom or commercial.

This section is structured as follows: first, Sect. 5.1 details how the system must be calibrated in a similar manner to commercial devices to ensure accurate eye-tracking. Once configured, Sect. 5.2 explains the algorithms used to track eye movement. These algorithms range from traditional detection methods, based on image processing and the recognition of eye regions and reflections caused by external and controlled light sources, to Artificial Intelligence (AI) algorithms. The first methods present a high computational cost for real-time tracking and thus have led to AI-based methods that require previous learning. Accordingly, the final subsection is dedicated to datasets for eye-tracking applications.

5.1 Calibration

To accurately estimate the gaze vector, these devices are initially calibrated per participant by displaying multiple uniformly distributed points, whose gaze vector is known a priori (Fig. 7). Most studies present calibration methods based on nine points (Qian et al. 2021; Chugh 2020; Lu et al. 2020; Li et al. 2019), although Qian et al. (2021) used fifteen points to select the most appropriate fitting function on a single participant. However, increasing the number of points beyond twelve has been reported to not yield any meaningful improvements (Drakopoulos et al. 2021). Regarding the gaze mapping model that correlates screen coordinates and pupil coordinates with the acquired image, previous research has extensively used linear (Drakopoulos et al. 2021; Dong et al. 2020) and second-order polynomials (Sun et al. 2021; Lu et al. 2020; Li et al. 2019), as high order polynomial models show little improvement. However, quadratic and cubic polynomial models have been successfully applied to case studies where participants have a static pose (Qian et al. 2021). Besides pupil coordinates, other values can also be integrated into these polynomial models to account for head motion (Qian et al. 2021). During this process, coefficients from mapping functions vary; first, they can be initialized to average expected human values. Then, they can be optimized through algorithms such as Least Squares to minimize the distance between the calibration targets and calculated gaze locations (Chugh 2020; Lu et al. 2020). Some of the calibration measurements may be discarded whether they are considered outliers through statistical analysis, thus avoiding inaccurately estimating the model coefficients (Chugh 2020).

Fig. 7
figure 7

Calibration procedure followed by Dong et al (2020), following a predefined path

5.2 Eye-tracking

Once calibrated, images acquired either from RGB or IR cameras, as well as numerical data from headset sensors must be processed to estimate the origin of the gaze vector. The most frequent pipelines in computer vision for estimating the gaze vector are feature-based and model-based methods, besides AI ones. Feature-based procedures emphasize the finding of features in images, e.g., the iris. These are mainly guided by simple image transformations, such as changing the image intensity, thresholding and morphological operations. The output is typically a binary mask with the location of the target feature if found. These are known to be more dependent on specific devices, as intensity-based pipelines are more sensitive to changes regarding lighting. On the other hand, model-based methods are intended to look for specific shapes, such as circular objects. Unlike feature-based methods, the output of this variant is the geometry of a specific part of the eye. However, most of the methods in the latter category frequently perform pre-processing operations similar to those belonging to the feature-based category. Therefore, both kinds of algorithms end up being affected by the recording conditions. In addition, model-based algorithms are more time-consuming due to the shape-fitting phase, at the expense of being more robust and precise.

Otherwise, images and data collected from eye-tracking can be processed with AI algorithms that either extract relevant eye parts from the image or directly estimate 2D screen locations and 3D gaze vectors. Nevertheless, most of these operate as supervised algorithms that require previously annotated datasets that must be representative enough to operate over people from different demographic groups (sex, age, race, etc.) and accessories such as glasses. Furthermore, AI-based techniques require further computational resources for training. In this regard, previous work has investigated the use of simpler networks to operate over lower consumption devices such as Raspberry Pi 3 (Katrychuk et al. 2019).

There are other kinds of algorithms that have not been included since the most recent works date to one decade ago. For instance, shape-based methods use deformable eye templates that must be fitted with an actual human eye. Others, such as the cross-ratio category described in Kar and Corcoran (2017), have been included in feature-based and model-based categories as they intend to find the projected light sources on the eye.

In summary, feature-based methods are known to be more dependent on recording conditions (observe Fig. 8), due to computer vision processing, although they are also less time-consuming. Model-based works are slightly more complex and typically rely on intensity processing techniques as well. Finally, AI-based studies are not as time-consuming in real-time, but they require training datasets captured in conditions similar to those found in a case study. A challenge in common for all of them is to make them robust to different demographic groups. More insight into these categories is provided in Table 5, where the accuracy of the following revised documents is reported.

Fig. 8
figure 8

Lens reflection observed by (Drakopoulos et al. 2021), thus hardening the image processing

Table 5 Classification of revised eye-tracking methods according to the type, the input data, the recording set-up and the reported results

5.2.1 Feature-based

A naive approach is based on the finding of the iris, from where the gaze vector can be cast. To this end, feature-based algorithms detect key features with the help of intensity values. Accordingly, the pupil is known to be the darkest element within the eye region (Drakopoulos et al. 2021). Captured images can be slightly enhanced for later detection through the suppression of reflections, defined as image regions with abrupt intensity peaks. To achieve this, bright regions are averaged with their neighbors. Due to their low contrast, an histogram equalization such as Contrast Limited Adaptive Histogram Equalization (CLAHE) can also be applied (Drakopoulos et al. 2020, 2021). Concerning the procedure core, Dong et al. (2020) proposed to enhance the image contrast, apply a Haar-cascade eye detector (implemented in OpenCV) and threshold the image to output a binary image. Then, connected graphs are calculated and wrapped in a convex hull. The center of the largest connected component is considered to be the target feature within the eye pupil. Finally, Drakopoulos et al. (2020, 2021) described a variant of the traditional Hough transform, accelerated in 2D and focused on circular features to detect the iris. Then, extracted circular shapes are assigned a confidence metric using a linear transformation combining two visual features and weights extracted from experimentation. Therefore, challenging conditions lead to lower confidence values.

Despite feature-based methods being more efficient, most of them cropped the images in real-time to reduce their size. The area to be cropped is determined either by the Haar-cascaded filter or in the calibration process. Hence, a safe eye area can be represented through a rectangle-shaped Region of Interest (ROI) whose corners are given by the minimum and maximum coordinates of detected iris center points (Drakopoulos et al. 2021).

5.2.2 Model-based

Another relevant procedure is based on the finding of ellipses representing the iris or pupil. A frequent step prior to such a fitting process is the binarization of the image to extract features with elliptical shapes. Sun et al. (2021) seeks the image area with the lowest average grey value, which is binarized to perform ellipse fitting on the pupil. For a single IR light source per eye, Qian et al. (2021) propose to segment the pupil by applying adaptive intensity thresholding, dilation and erosion, edge detection and ellipse fitting. The main objective with multiple IR light sources is to binarize their reflection and therefore apply the ellipse fitting on such points (Lu et al. 2020) (Fig. 9). However, they enhance the fitted ellipse with image processing, with the pupil center being detected using a combination of morphological operators (dilation, erosion), smoothing, watershed starting from the darkest area (pupil) and edge detection. Hence, the first ellipse is adjusted using the Least Squares method and such polygon.

Fig. 9
figure 9

Ellipse fitting process of Lu et al (2020). From left to right: initial image cropping, cropping after pupil detection, binarization, detection of light spots and ellipse fitting

Regarding the efficiency of model-based methods, Sun et al. (2021) did not crop a predefined area, instead, the image was partitioned into smaller windows and only the one containing the pupil is further processed. The eye corners, and thus its bounding box, can also be manually marked during calibration and tracked in the following frames using state-of-the-art methods such as Discriminative Correlation Filter with Channel Spatial Reliability (DCF-CSR) (Qian et al. 2021).

5.2.3 Machine learning and deep learning

The increasing use of AI has also favored the proliferation of gaze-tracking works using Deep Learning (DL) and Machine Learning (ML) over infrared and visible images. More specifically, Convolutional Neural Networks (CNN) are the most widespread networks applied to gaze tracking and segmentation of eye parts (Chugh 2020; Katrychuk et al. 2019; Fuhl et al. 2020; Kothari et al. 2022), achieving the top-most accuracy in Table 5. Yet, calibration is required unless working with publicly available datasets.

The most frequent procedure is to use head or eye images to estimate the gaze vector, especially for laptop applications without 3D interactions. Wong et al. (2019) proposed a ResNet network for inferring the gaze vector, using as input a face image with color normalization and the head pose, given by yaw, roll and pitch angles. In the case of driving, the location is not always as accurate and instead, previous works have defined relevant gaze areas that must be estimated. Naqvi et al. (2018) created their own driving dataset with 17 different gaze zones and used a CNN with a significant number of convolutions to estimate such zones. On the other hand, Illahi et al. (2022) proposed to predict gaze targets in real-time by solely using the normalized X, Y screen coordinates as well as the gaze velocity, head rotational velocity, gaze acceleration and head rotational acceleration. These variables were the input of a Recurrent Neural Network (RNN). Katrychuk et al. (2019) compared a Multilayer Perceptron (MLP) against a shallow CNN, using several configurations from low-power to high-power. These set-ups are intended to optimize real-time tracking in low-consumption systems such as Raspberry Pi 3. The shallow CNN obtained better results with the high-power configuration (deviation of 0.55°), as expected; however, other configurations may be preferred for speeding up the training and real-time tracking.

Other works are intended to estimate the pupil location from imagery, supported by previous calibrations that help to transform the detected location into a gaze vector (Ou et al. 2021). Ou et al. (2021) used the YOLOv3 network to predict the pupil’s center using their own dataset of nearfield visible images. Otherwise, it is possible to conduct semantic segmentation over images, rather than solely detecting the pupil location. Chaudhary et al. (2019) proposed a U-Net-like network, in contrast to SegNet, to perform semantic segmentation over the OpenEDS dataset. Similarly, Kothari et al. (2022) trained an encoder-decoder network, DenseElNet, using multiple publicly available datasets, concluding the use of multiple of them effectively helped to obtain better results. Chugh (2020) described a simple CNN with feature upsampling and downsampling to estimate pupil location from infrared images of individual eyes, obtaining a mean distance of 1.2 pixels from the expected output. Lu et al. (2022) trained a simple CNN with 3 layers to de-refract eye images and shape the pupil’s ellipse with five parameters (center, axes and tilt angle). They achieved an error of at least 2 mm in the estimation of the 3D pupil.

Another not-that-frequent set-up is to use webcams. In this case, the expected output is typically the pixel at which the user is looking. Gudi et al. (2020) split their methodology into two steps: (1) estimate the gaze vector from input images and (2) transform gaze vectors into gaze locations in the screen. The first step was solved with a VGG16 pre-trained network, whereas the second step was solved in three manners: (1) estimating the coefficients for translating one space into another, (2) using ML and (3) a hybrid method, where ML is used to estimate the coefficients. The latter outperformed the other two, while still obtaining errors above 4 cm. de Lope and Gran˜a (2022) checked several pre-trained networks over their own dataset, extracted from a laptop camera, and found that DenseNet had the highest accuracy (91.30% with multi-user dataset).

Rather than making DL intricate for the general public, there exist frameworks able to optimize the network and its hyperparameters (Bublea and C˘aleanu 2020). With this approach, a shallow CNN was used to obtain an accuracy of 85% over the Columbia Gaze dataset.

5.3 Eye-tracking datasets

There is a plethora of research concerning the publication of eye-tracking datasets for the intensive training of ML-based solutions. The main concern of gaze prediction datasets is to cover the population with a wide range of physical features, based on different genders, ethnicities, eye color, age and accessories (e.g., glasses, makeup, etc.). Eye-tracking datasets can be classified according to the task that participants carry out during acquisition, mainly split into real-world and elicited tasks. Despite this, they do not present patterns regarding the dataset properties. Following the classification of Palmero et al. (2021), datasets are mainly divided according to their illumination, sampling frequency, image resolution, number of participants, number of image sequences, annotation and whether head movements were allowed or not, as regarded in Table 6. Note that a considerable number of previously revised works construct their own datasets; however, these are typically small in contrast to the ones that will be following presented.

Table 6 Summary of publicly available eye-tracking datasets, regarding their illumination, sampling frequency, image resolution, number of participants, allowed head movement, number of images and provided annotations

In this way, Palmero et al. (2021) presented an outstanding dataset to assess both gaze prediction and sparse segmentation in AR and VR fields. As part of the prediction challenge, two different datasets were published. The first dataset consists of sequences of images from 87 subjects that were asked to gaze at specific dot patterns, thus allowing to collect saccade and fixation eye movements and the corresponding ground-truth vector. The obtained images were curated by avoiding blinks, incorrect detections, subject distractions and randomly selecting frames. Furthermore, the dataset was augmented by flipping horizontally each image, thereby providing a training dataset both for left and right eyes. Hence, the notation of each image is its corresponding 3D gaze vector within the headset coordinate system. Second, a dataset for eye region, iris and pupil segmentation was provided by means of image masks manually annotated. Similarly, it can be flipped to augment the dataset to train models appropriately for both eyes.

As most of the publicly available datasets are based on pupil and iris detection, Chugh (2020) collected a dataset for the extraction of corneal reflections by combining both their own dataset and an unlabelled dataset from NVIDIA (Kim et al. 2019). Then, corneal reflections were marked manually, thereby generating one binary mask per light source and image. The resulting dataset is finally generated and augmented by cropping images and applying image-based operators, such as gaussian and motion blurring, contrast adjustment or adding synthetic reflections. However, this dataset may be constrained to devices with similar IR light configurations.

Besides real datasets obtained from participants, Kim et al. (2019) augments their dataset with synthetic images (Fig. 10b). To this end, a set of models were used along with virtual light sources and realistic rendering to acquire synthetic eye imagery, thus allowing to obtain high-resolution data. Similarly to their real dataset, 4 IR light sources were simulated.

Fig. 10
figure 10

Eye-tracking imagery from some cited datasets. a Comparison of saccades and smooth pursuit from OpenEDS dataset (Palmero et al. 2021), whereas b shows both synthetic and real-world datasets, although only the second one presents varying lighting conditions (Kim et al. 2019)

Instead of applying frequent data gathering operations, such as point tracking for the collection of saccades, fixations and smooth pursuits (Palmero et al. 2021; McMurrough et al. 2012) (Fig. 10a), other studies propose real-world tasks, ranging from indoor navigation or visual search (Kothari et al. 2020) to car riding (Fuhl and Kasneci 2021). However, they require automatic labelling with high confidence (Kothari et al. 2020; Fuhl et al. 2019) that may be preceded by manual labelling (Kothari et al. 2020; Tonsen et al. 2016). Then, data must be curated to avoid erroneous training samples, either by discarding events with abnormal duration (Kothari et al. 2020) or error entries (Fuhl and Kasneci 2021) derived from subject distraction, blinks or incorrect data. Note that false positives are far more harmful than false negatives for training purposes, as the second only reduces the available dataset. Furthermore, most of the experiments are not performed using VR headsets. Instead, most of them are solely based on eye-tracking glasses (Tonsen et al. 2016; Katrychuk et al. 2019; McMurrough et al. 2012; Kothari et al. 2020; Fuhl and Kasneci 2021), allowing better handling of the hardware and varying lighting conditions. Thus, avoiding continuous lighting may be desirable for applying the proposed datasets to other environment configurations. Nevertheless, Kim et al. (2019) managed to alter lighting conditions within VR headsets, whereas other works present a constant light source (Chugh 2020; Palmero et al. 2021).

Regarding user driving, Ortega et al. (2022) published a large dataset that provides, among other features, the location at which the user is looking (pre-defined, from zero to nine) as well as the bounding boxes of face, head, eyes and other objects that are used as distractors while driving. The dataset also includes frames with users yawning, having microsleeps, texting or drinking.

Other recent publications, such as Lu et al. (2022), have published 3D datasets aimed at providing the 3D pupil parameters together with the 2D location and the gaze vector. To this end, the users’ head was stabilized during the data collection, and the pupil detection process was significantly improved to surpass previous work. The 3D detection was observed to have a mean error above 2 mm, whereas the gaze vector estimation had an error of 4.38°. Garbin et al. (2020) collected a dataset that split the eye region into eyelid, pupil and iris, and even is able to provide 3D point clouds of the corneal topography. The dataset is composed of 12,759 annotated images collected from 286 subjects.

A summary of the described studies for eye-tracking dataset generation is shown in Table 6. These are the latest and most relevant regarding their size, though other notable work precedes those included in the summary (Fuhl et al. 2015, 2016, 2019). NVGaze (Kim et al. 2019) is solely described according to the real dataset, though a synthetic dataset is also generated. This approach allowed them to generate more advanced region maps. Accordingly, they were able to produce image samples along with the following labels: 2D gaze vector, head position, eye-lid states, pupil size, 2D iris center and pupil center, as well as accurate region maps that separate skin, pupil, iris, sclera, and corneal reflections.

As a future research line, Emery et al. (2021) offer new possibilities since they provide the head and hand orientations as well as the rendered frames, with the 3D gaze vector being the ground truth. Bozkir et al. (2020) introduced a protocol for collecting eye-tracking data in VR remotely. Miller et al. (2021) crafted a post-processing framework combining mobile eye-tracking with motion capture, enabling the computation of a 3D gaze vector tied to object and body positioning in VR, considering metrics such as gaze direction. Finally, Demir and Ciftci (2021) investigated the importance of gaze-tracking to discern fake videos from real ones.

Due to the high number of eye-tracking datasets, Kothari et al. (2022) studied the use of several of them applied to the segmentation of pupil and iris with an encoder-decoder network. As a result, they concluded that the use of several datasets helped to better generalize. Instead, for specific configurations regarding lighting, allowed head movement or scenario configuration, datasets ought to be filtered.

6 Rendering

Rendering optimization is a recurrent topic in Computer Science research, especially for realistic image synthesis. The nature of VR also leads to worsening the performance of the rendering pipeline, as the trivial approach requires rendering the scene twice, one for each eye. Furthermore, the trend in VR is to increase both refresh rate and image resolution to provide a better user experience. Several optimizations have been proposed to deal with VR shortcomings, although most of the reviewed works focus on using devices that already implement these techniques as their underlying rendering core. Besides improving the rendering, another objective of these optimizations is to avoid discomfort from cyber-sickness. Otherwise, factors such as high latency may lead users to feel they are not there due to discrepancies between visual, vestibular, and pro-prioceptive awareness. As regarded by previous work (Bayramova et al. 2021; Valori et al. 2020), proprioception is the sense of position and movement in space.

Rendering techniques on VR must be interpreted following the development of eye-tracking, as it presents a timeline guided by incremental steps that finally lead to the recent integration of VR and eye-tracking, with the latter being supported via hardware and software. Figure 11 shows the milestones that occurred since the 1800s, finishing with the commercialization of the VR devices that were previously mentioned in Sect. 4. Matthews et al. (2020) jointly reviewed the history of VR and eye-tracking, and thus, this chapter is briefly summarizing some of the identified main events. Accordingly, the development of VR started with the concept of stereoscopy (Brewster 1856) and evolved in waves. In parallel, eye-tracking and gaze-tracking were rudimentarily started in 1910 with a device affixed to the user’s eye (Huey 1968). The first non-intrusive manner of recording the eye’s movement emerged in 1937, where the beam light reflections were recorded in a piece of film (Hartridge and Thomson 1948). Then, a disruptive work was published in 1968 by addressing optimizations in the eye-tracking procedure (Yarbus 1967). From here, eye-tracking was mainly investigated as an alternative human–computer interaction (HCI) (Bolt 1982), and evolved into non-intrusive solutions such as webcams and IR cameras. The two most recent milestones come with (1) the main VR industry leaders emerging in 2016, including Oculus, HTC, Sony or Valve, and (2) eye-tracking being integrated into VR headsets (Vive Pro and Vive Cosmos along with Tobii) and the first 8 K headsets being announced.

Fig. 11
figure 11

Timeline from 1838 to nowadays showing the main milestones related to rendering advancements

During this development, optimizations on the rendering using eye-tracking were studied and laid the foundations of nowadays current foveated rendering, i.e., concentrating the computational resources at the point at which the user is looking. Following this approach, areas falling out of the focus area can be rendered with a lower level of detail (LOD). These LOD variations were first handled with geometric simplifications (Zheng et al. 2018), followed by adaptive rendering with variable resolution across the image (Meng et al. 2020). Similar to geometry, shading simplifications are also trivial; more accurate and time-consuming techniques are applied over the focused area, whereas more efficient, and less accurate methods are applied over peripheral areas (Xiao et al. 2018). On the other hand, a lower effort has been put into spatiotemporal deterioration. In the latter, the refresh rate is configured across the image, and even data stored in the cache can be used multiple times in non-relevant areas (Franke et al. 2021). With this in mind, VR rendering optimizations are following classified into traditional and perception-based optimizations.

6.1 Traditional optimizations

The naive approach is based on duplicating the draw calls to render the view of both eyes, thereby introducing a large latency overhead for dense scenarios and intricate rendering pipelines, including ray-tracing. This increased latency is even higher for lighting techniques that require additional draw calls per light, such as cascade shadow mapping (White et al. 2021). Instead of applying multi-pass rendering for each view, Multi-View rendering (MVR) has been the preferred solution for addressing this problem. The underlying concept in MVR is to avoid several draw calls, which are known to be time-consuming, and instead, calculate the outcome of several views during a single one, in contrast to Single View Rendering (SVR). This problem has been addressed for decades, starting from the rendering of splats which were later used as impostors for different viewpoints (Schaufler and Stu¨rzlinger 1996). Storing previous shading calculations in a cache (Sitthi-amorn et al. 2008) and enforcing restrictions in the discrepancies between different viewpoints have also been investigated (Halle 1998). Other optimizations rely on determining the potentially visible set (PVS), mainly in a camera ←  → pixel relation, though more recent work has studied the correlation of camera movements and pixels (Hladky et al. 2019) to speed up the rendering.

MVR was first supported by Nvidia hardware with Pascal architecture for generating up to two simultaneous views, as regarded in Fig. 12. The same draw call is able to render two different views of the scene, each one with a different camera orientation, which are stored in two different textures. Lately, it has been improved for GPUs with Turing architecture, enabling the rendering of four different views for ultra-wide FOV. These new features can be accessed through the extension named OVR_multiview (NVIDIACorporation 2018; Unterguggenberger et al. 2020) in the OpenGL standard (Open Graphics Library), Vulkan, DirectX11 and DirectX12. However, MVR still requires multiple fragment shader calls despite using one pass. With this in mind, Unterguggenberger et al. (2020) explores the latency derived from MVR with a flexible framework that proposes several possibilities regarding geometry instancing, framebuffers as well as culling and clipping. Depending on the number of subdivisions, scene size and GPU architecture, the best pipeline was shown to vary among configurations.

Fig. 12
figure 12

Multi-View Rendering of two different versions of the scene, with two target textures

6.2 Perception-based rendering

Rather than optimizing the rendering pipeline, the resulting image can be adapted according to the psychophysical aspects of the human’s eye, which can be split into three regions, from lower to higher angular covering: (1) foveal region, (2) inter-foveal and (3) peripheral. The two latter regions present lower visual acuity, though they are also sensitive to motion, and thus could be used to guide the user’s attention. This knowledge leads to the widespread foveated rendering (Fig. 13).

Fig. 13
figure 13

Adaptive rendering of a scene according to eye-tracking, thereby allowing to optimize rendering in the foveal region

In the initial stages of GPU development, only an overall fixed sampling frequency could be utilized, with fragment shaders being executed once per pixel. Still, this remains the standard approach unless indicated otherwise. On the other hand, multi-sampling enable fragment shaders to be run multiple times per pixel; however, the number of shader calls per pixel remains constant over the entire image. The first solutions to overcome this were based on three draw calls, each one with lesser quality (Guenter et al. 2012), as well as geometric techniques based on targeted mesh simplifications (Weier et al. 2014) and LOD (Mohanto et al. 2022). More recently, this limitation has been resolved with Multi-Rate shading, which allows varying the number of calls per pixel and even requiring calls for a group of pixels (NVIDIA Corporation 2020). This technique is currently supported by hardware with the NV_shading_rate_image extension for Turing GPUs. With the Nvidia extension, we can control the number of calls per pixel, SHADING_RATE_N_INVOCATIONS_PER_PIXEL_NV, and narrow it to a few calls for surrounding pixels, SHADING_RATE_1_INVOCATION_PER_IxJ_PIXELS_NV, according to a predefined enumeration.

However, this variation based on the eye’s target also leads to aliasing and image disruptions that can be recognized by users. To solve this, the adaptive image has been processed, for instance, by addressing the contrast variation (Patney et al. 2016). Similarly, image contrast has been used to increase or reduce the number of samples depending on the contrast of an image area (Tursun et al. 2019). Even the dominance of one eye over the other can be exploited to reduce the rendering latency of an eye with lower visual acuity (Meng et al. 2020). Also, Generative Adversarial Networks (GANs) have been applied to the generation of plausible peripheral images from the foveal area (Kaplanyan et al. 2019). Tariq et al. (2022) included procedural noise for specific spatial frequencies that otherwise would be detected by users in oversimplified areas. Besides isolated frames, the encoding and compression of foveated-rendering videos have also been studied (Illahi et al. 2020).

7 Applications

We have reviewed different fields of applications for VR and eye-tracking, thus leading to the following classification: medicine and biology, neuroscience and marketing, engineering and architecture and education and training. These categories highlight how some disciplines leverage the benefits of VR and eye-tracking showing different ways of using these technologies. Figure 14 shows the number of papers that have been found for the different applications.

Fig. 14
figure 14

Number of studies devoted to each one of the considered application fields

7.1 Medicine and biology

The use of virtual reality in medicine has been increasing in recent years as the technology has improved. One of the improvements that added great value to VR was eye-tracking. This allowed the introduction of improvements in different medical studies and meant a new form of innovation in fields such as rehabilitation, treatment of eye problems, vertigo, anxiety, Alzheimer’s disease, among others.

For rehabilitation, Fromm et al. (2019) introduced the potential of home rehabilitation through VR, enhanced by eye-tracking. Park et al. (2019) explored the feasibility of eye-tracking-assisted vestibular rehabilitation. Their experiments, which employed saccadic eye exercises, primarily measured spatial accuracy. They concluded that eye-tracking algorithms enhance vestibular rehabilitation using HMDs. Similarly, Lee et al. (2020) adopted vestibular rehabilitation exercises in VR, including Cawthorne–Cooksey and Herdman training, emphasizing the spatial accuracy measurements. Their findings suggest VR-based methods, coupled with eye-tracking, could offer safer and more engaging rehabilitation.

In the domain of ocular diseases, Tan et al. (2020) introduced an eye-tracking-aided VR system for amblyopia pediatric care, utilizing metrics like fixations count, dwell-time and transitions order. Their system dynamically adjusted the difficulty based on eye-tracking data, enhancing amblyopia treatment outcomes. Also, knowing the gaze position, eye-tracker can provide assistance to help the patient achieve the goal by providing a tip. Yaramothu et al. (2019) assessed the VERSE video game for vision therapy, capturing clinical measures such as Near Point of Convergence (NPC), Positive Fusional Vergence (PFV), near/far dissociated phoria and Convergence Insufficiency Symptom Survey (CISS). Results highlighted the game’s efficacy. Yeh et al. (2021) employed an eye-tracking VR system to measure ocular deviation in strabismus patients, finding a strong correlation with the traditional alternate prism cover test. Martınez-Almeida Nistal et al. (2021) analyzed glaucoma patients’ gaze patterns, emphasizing metrics like saccadic velocity, fixations count and fixation/saccadic ratio. Lastly, Mehringer et al. (2021) contrasted Hess Screen Test results in VR using eye-tracking with monitor-based methods, noting variations in measured visual deviation angles.

Beyond these domains, VR integrated with eye-tracking has eased Magnetic Resonance Imaging (MRI) procedures by minimizing patient anxiety, as noted by Qian et al. (2021). Al-Ghamdi et al. (2020) showcased eye-tracked VR as a potential analgesic during painful medical procedures. Davis (2021) evaluated VR and eye-tracking’s application for Alzheimer’s patients, with fixations being a key metric to monitor the disease’s progression. In biology, Gunther et al. (2020) introduced Bionic Tracking in VR using eye-tracking to trace biological cells, demonstrating its accuracy compared to traditional methods. Table 7 shows an overview of some of the applications mentioned, together with the most cited article in each of them. The most frequently observed metrics in these fields are shown in Figure 15.

Table 7 Brief summary of some of the most important studies carried out in the field of medicine and biology using eye tracking and VR
Fig. 15
figure 15

Number of occurrences of different metrics in medicine and biology applications

7.2 Neurosciences and marketing

Virtual reality, enhanced by eye-tracking, is reshaping our understanding of the human mind in areas such as neuroscience and marketing. This combined approach offers deeper insights into human behavior, brain disorders, and even emotion recognition.

Within the realm of neglect disorders, Hougaard et al. (2021) demonstrated the value of VR and eye-tracking in assessing spatial neglect subtypes in stroke patients. Their experiment utilized metrics like dwell-time, fixations count and eye orientation, revealing significant differences in eye-tracking data between stroke and healthy patients. Similarly, Ogura et al. (2019) designed a VR application that quantitatively assessed the visual field in patients with unilateral spatial neglect using color changes in observed blocks. Additionally, Porras-Garcia et al. (2019) explored attentional bias toward body parts in VR and eye-tracking, emphasizing the differences in fixations between genders on weight-related and non-weight-related areas.

On human behavior, Reichenberger et al. (2020) analyzed how social anxiety influences attention in VR, especially concerning emotionally threatening stimuli, using dwell-time and fixations count. Wang et al. (2019)’s study on VR advertising highlighted that commercial objects are not the primary focal points, based on measures such as fixations count, duration of first fixation and total fixation duration. Pettersson (2021) conducted a thesis harnessing VR-based eye-tracking metrics, namely gaze direction, pupil position, and pupil diameter. These data were subsequently analyzed using a neural network to decipher user behavior. Melendrez-Ruiz et al. (2021) concluded that pulses were not effective eye-catchers in a VR supermarket setting after considering metrics like dwell-time and fixations count. In contrast, Tian et al. (2019) show-cased the potential of VR and eye-tracking in fire escape behavior analysis, emphasizing the efficiency of the approach.

For prediction purposes, Stein (2021) is pioneering a method to predict locomotion paths in VR, linking various behavioral data including eye-tracking. He examined metrics like eye-tracking latency and distance between users’ gaze and a target for predicting future paths. Wechsler et al. (2019) indicated that gaze behavior, when analyzed using fixations count and dwell-time, could predict physiological stress responses. Furthermore, Huizeling et al. (2021) asserted that hesitation words during speech impact prediction capability based on eye-tracker fixations count.

Concerning emotion and recognition, Liu et al. (2020) discovered that stereoscopic images enhance facial recognition, as evidenced by measures such as fixations, dwell-time and pupil diameter. Geraets et al. (2021) compared facial emotion recognition across different media using VR and eye-tracking metrics like fixations count, dwell-time and total fixation duration and highlighted the potential of VR for emotion recognition training. Similarly, Tabbaa et al. (2021) compiled a dataset integrating eye-tracking data (gaze position, eyes status) and physiological measurements for VR emotion recognition, and Bozkir et al. (2019) presented an approach for recognizing driver cognitive load in VR and eye-tracking by collecting pupillary information, gaze position and performance measures (inputs on the accelerator, brake, and steering wheel) from a VR driving experiment to train multiple classifiers. Lim et al. (2021) undertook an initial study utilizing pupil position in VR eye-tracking to discern emotions. Their findings suggest the promising potential of pupil position as an emotion recognition metric. In another significant study, Hickson et al. (2019) introduced an algorithm that utilizes eye-tracking to infer facial expressions, even with partial face occlusion. Their trials with various convolutional neural networks yielded an impressive mean accuracy of 73%, surpassing the proficiency of advanced human raters.

Additional studies like Kobylinski and Pochwatko (2020) focused on movement detection in VR narration. Sterna et al. (2021) proposed an ideal design for psychophysiological and eye-tracking measurement in VR, emphasizing the fixations count. Mirault et al. (2020) analyzed transposed-word effects on reading using eye-tracking metrics in VR such as fixations count, dwell-time and gaze position. Maraj et al. (2021) investigated immersion and comfort using eye-tracked VR devices, concluding no significant difference in user responses. Jurik et al. (2019) explored the use of eye-tracking in VR, advocating for its potential in cross-cultural studies on human perception and cognition. Ryabinin et al. (2021) examined hierarchically segmented images, such as historical paintings, employing metrics like gaze position, fixations count, and saccades count. In the broader context, Meißner et al. (2019) discussed the role of VR and eye-tracking in marketing research, stressing the importance of metrics like patial accuracy, gaze position, fixations count and saccades count, while Soret et al. (2020) explored how auditory and visual stimuli impact attention in VR by evaluating saccadic reaction time. Marwecki et al. (2019) introduced “Mise-Unseen” software in VR, leveraging eye-tracking to determine optimal scene change moments based on user attention, spatial memory, and metrics such as gaze position, spatial accuracy and pupil diameter. In the realm of VR storytelling, Yang et al. (2021) conducted a study to discern if eye-tracking could serve as implicit interactions, exploring the boundaries between implicit and explicit interactions in VR eye-tracking. Lastly, Wang et al. (2020) evaluated which stimuli sources are the most effective to guide users in Cinematic Virtual Reality (CVR) by using eye-tracking data such as gaze position and reaction time.

Table 8 shows an overview of some applications mentioned selected on the criteria of relevance (citations), whereas the popularity of different metrics in this field is depicted in Fig. 16.

Table 8 Summary of some important studies in neuroscience and marketing using eye tracking and VR
Fig. 16
figure 16

Number of occurrences of different metrics in neuroscience and marketing applications

7.3 Engineering and architecture

In the fields of engineering and architecture, eye-tracking within VR has been applied to diverse areas from risk assessment and situational awareness to architectural design and algorithm enhancement. Kang et al. (2020) delved into the correlation between visual paths and situational awareness, utilizing eye-tracking metrics like fixations and saccades in an oil rig anomaly detection scenario. Their findings revealed distinct visual paths between participants with varying levels of situational awareness. Khatri et al. (2020) focused on refining age classification during a virtual shopping task by optimizing the Dispersion Threshold Identification algorithm using eye-tracking spatial distribution. Cubero (2020) predicted user choices using time-series data and LSTM, a type of RNN, analyzing gaze position. Dong et al. (2020) introduced the “Central-Eye” eye-tracking method to discern human gaze focus, utilizing metrics like spatial accuracy and saccadic velocity. Additionally, Pettersson and Falkman (2020) classified human movement direction in a virtual environment using metrics like gaze direction, pupil position and pupil diameter to enhance collaborative robot intelligence, with a follow-up study by Pettersson and Falkman (2021) on predicting human movements using neural networks. In architectural design, Barsan-Pipu (2020) blended brain-computer interface (BCI), eye-tracking, VR, and AI-driven neurofeedback to discern designers’ conceptual design intentions, providing dynamic responses. Zhang et al. (2019) proposed a comprehensive method for cityscape design and protection, merging cognitive psychology, spatial behavior, and sociology, while emphasizing metrics like gaze position. Lastly, Özel (2019) developed a hazard recognition system for construction sites using VR and eye-tracking, investigating the impact of work experience and education on hazard recognition using metrics such as dwell-time and total fixation duration and total fixation duration.

Table 9 shows an overview of some of the applications mentioned. The selection has been made based on the citations of the articles, choosing the most cited article of each application. On the other hand, Fig. 17 shows the number of occurrences of previously revised metric in engineering and architecture fields.

Table 9 Some relevant studies in the area of engineering and architecture using eye tracking and VR
Fig. 17
figure 17

Number of occurrences of different metrics in engineering and architecture applications

7.4 Education and training

Wang et al. (2021) employed VR and eye-tracking to explore how students, varying in prior knowledge, process visual behavior related to Japanese mimicry and onomatopoeia in learning Japanese as a second language. They ye-tracking on virtual reality: utilized eye-tracking metrics like fixation count and dwell-time, expanding on the application of visual behavior in real-time VR environments. Meanwhile, Khokhar et al. (2019) introduced an architectural framework for educational VR, aiming to make VR pedagogical agents responsive to user attention changes monitored via eye-tracking, leveraging metrics such as pupil diameter, gaze direction or distance between user’s gaze and a target. Additionally, Bacca-Acosta and Tejada (2021) delved into efficient eye-tracking data collection in 3D virtual environments and further investigated students’ visual behavior during English preposition learning in Bacca-Acosta et al. (2021) by analyzing fixations count and dwell-time from eye-tracking. Their findings indicate the efficacy of dynamic IVR environments with integrated scaffolds for enhanced learning performance, though they also highlighted challenges with the preposition "on”. In addition, Gadin (2021) examined legibility in VR text displays through eye-tracking metrics, including fixation count, dwell-time and amplitude of saccades. Lastly, Komoriya et al. (2021) devised a system enabling handicapped individuals to compose text on computers through eye blinking and shifting. Post-feasibility analysis, notable enhancements in desktop maneuvers, success rates, and reduced execution times were observed.

In the field of training, Laivuori (2021)’s thesis employed VR eye-tracking in a simulation training sea captains, using metrics like gaze position and pupil diameter. Burova et al. (2020) developed a training application emphasizing safety awareness, considering metrics like fixation count and gaze position. For sports, Mutasim et al. (2020) harnessed eye-tracking in VR to boost user performance, analyzing aspects like reaction time or search time.

Table 10 shows an overview of some of the applications mentioned taking into account the most cited studies. As in previous applications, Fig. 18 depicts the number of occurrences of different metrics.

Table 10 A selection of studies with eye tracking and VR applied in the field of education
Fig. 18
figure 18

Number of occurrences of different metrics in education applications

8 Discussion

The latest major addition to virtual reality has been eye-tracking technology. Traditionally, studies using the user’s gaze direction were conducted on a computer screen where external eye trackers could be attached. However, with the addition of this technology to VR headsets, a new range of possibilities opens up to design more immersive applications where the user’s gaze can play an important role, not only for studies where the gaze provides relevant information but also for the inclusion of new interaction and optimization mechanisms. This integration not only allows us to explore more fields of applications for VR but also to improve the use of VR in existing fields: for example, it is possible to improve the rendering performance of VR applications with eye-tracking by concentrating computational resources at the gaze location. Among the opportunities offered by this technology, we can highlight better monitoring of user attention with the help of precise estimation of gaze vectors. Also, the registration of pupil dilations related to stimuli with a relevant meaning for the user helps to determine emotions without the need for external hardware such as heart rate or respiratory monitors (Finke et al. 2021).

However, there are still open problems as well as current and future works that must be pointed out. As reviewed in this survey, eye-tracking integration cannot yet be found in many virtual reality headsets (see Table 3) and these are often expensive. In addition, the usefulness of these integrated eye-tracking systems depends on their configuration and accuracy, and over this, the software layer is another key factor to get the most out of this technology. A nuanced understanding requires an examination of the technical specifications of various devices. HMDs with integrated eye-tracking typically report an accuracy ranging between 0.5° and 1°. However, in certain real-world evaluations with devices like the FOVE0 and HTC VIVE Pro Eye, this accuracy has been observed to deviate, reaching up to 2° (Chernyak et al. 2021; Lamb et al. 2022), contingent on the experimental conditions. Comparatively, desktop eye-tracking devices, exemplified by the Tobii Pro Spectrum, boast a superior accuracy threshold of 0.3° (Tobii 2022a, b) under optimal conditions. Yet, it is worth noting that even this touted accuracy has been challenged in practical applications, with some studies suggesting a deviation nearing 1° (De Kloe et al. 2022). Additionally, these may face accuracy issues due to the prism effect. This phenomenon arises especially when the center of the eyeglass lens does not align seamlessly with the center of the pupil, introducing potential error margins (Yeh et al. 2021). This juxtaposition between HMD-based and desktop eye-tracking systems elucidates a tangible challenge: the pressing need for the further refinement of HMD systems to approach, if not surpass, the accuracy benchmarks set by their desktop counterparts.

Not all scenarios mandate the highest degrees of accuracy. For instance, in entertainment VR applications or basic interactive platforms, where the primary objective might be gauging user interest or attention, a moderate level of accuracy could be sufficient. Conversely, more specialized domains have stringent accuracy requirements. In medical training simulations or precision skill training modules, where the distinction of a few millimetres in gaze direction can make a significant difference, the demand for impeccable accuracy becomes paramount. Similarly, in research contexts where detailed gaze patterns are analyzed to derive cognitive or behavioral insights, the granularity and accuracy of eye-tracking are crucial. The aforementioned variations underscore the importance of tailoring eye-tracking systems and their accuracy thresholds to the specific objectives and requirements of individual VR applications.

A significant hindrance is the susceptibility of VR users to motion sickness, which can elevate drop-out rates in experiments, potentially skewing results (Clay et al. 2019). There is also the concern of position shifts in the VR headset after calibration, perhaps resulting from abrupt head movements or improper adjustments. Such shifts can considerably undermine accuracy. Addressing the aforementioned challenges necessitates a multi-pronged approach to enhance the robustness and reliability of eye-tracking within VR environments. One primary avenue for future exploration is the mitigation of motion sickness in participants. Advanced algorithms that predict and adjust for potential motion sickness triggers, based on individual user profiles, could be developed. Additionally, it might be beneficial to explore real-time re-calibration mechanisms (Plopski et al. 2016). These calibrations would serve to consistently maintain the accuracy of eye-tracking without disrupting the user experience. Furthermore, enhancing the ergonomic design of VR headsets to minimize position shifts will be pivotal. Collaborations with biomechanical engineers could lead to the development of headsets that balance comfort with secure placements, reducing the likelihood of inadvertent shifts during experiments.

Similarly, the requisite minimum frequency for eye-tracking, contingent on the type of eye movements being observed, inherently places hardware constraints. This stipulation might preclude the use of certain devices or necessitate specific configurations (Geraets et al. 2021). This limitation is not merely a question of selecting an appropriate device but embodies broader challenges in ensuring universal applicability and reproducibility of findings. One pivotal avenue for future investigation will be the development of adaptive tracking systems. These systems would intelligently modulate their tracking frequency based on the specific requirements of the ongoing task or experiment. Such adaptability could potentially obviate the need for high-frequency tracking during phases where it is unnecessary, conserving computational resources and power. Simultaneously, advancements in hardware miniaturization and processing capabilities are imperative. Collaborative efforts between eye-tracking research and hardware engineers might yield devices that, despite being compact, do not compromise on tracking frequency or accuracy. Innovations in semiconductor technologies and algorithmic optimizations can play a significant role in this direction. Additionally, exploring cloud-based processing solutions, where the eye-tracking data is processed remotely rather than on the device itself, might alleviate some hardware constraints (Zou et al. 2021). From a design perspective, the incorporation of an eye-tracking mechanism can alter the ergonomic properties of VR headsets. The additional weight or altered balance might impede prolonged usage, posing a challenge, particularly for extended experimental sessions or immersive experiences. Research into materials and design paradigms that allow for lightweight yet robust eye-tracking integration will be paramount. This would ensure that users can engage in prolonged VR sessions without discomfort, ensuring the integrity of extended experiments or applications.

Alternate to nowadays HMDs with integrated eye-tracking, a relevant number of studies have investigated custom eye-tracking devices using existing HMDs as the underlying structure. However, the optimal number of IR light sources that must be used per eye is not clear, varying from one to eight in the literature, although using a larger number is known to have negative effects on users (Qian et al. 2021). In addition, calibration must be performed from scratch by showing patterns and locations on the screen and calculating the parameters of a model, whose complexity is frequently reduced to second-order polynomial expressions. Custom devices are simply the baseline over which alternative eye-tracking methods are checked. These methods range from feature-based, intended to find features such as the iris, to model-based, focused on the detection of specific shapes, such as a circle on the iris and pupil locations, as well as ML and DL models. The main shortcomings of all these methods are that they are highly dependent on the tested recording conditions, and therefore, reflections and accessories such as glasses harden the recognition of eye parts. Note that, feature-based and model-based frequently use image processing techniques, including thresholding and morphological operators, to process the recorded images. Hence, minor deviations from the expected intensity distribution have notable effects on the outcomes. Despite this, revised articles reported accuracies even below 1°; however, these are typically observed against custom datasets and should be further checked to assure that these solutions are more effective than current commercial solutions.

On the other hand, machine learning techniques are starting to gain interest as they quickly determine the gaze point with low latency. CNNs are the most studied networks by far, either to perform semantic segmentation and output labelled eye parts or to calculate the normalized gaze location/vector. The majority of works use shallow CNN, which is expected to have a low memory footprint, and pre-trained networks, which are already known to perform well in classification tasks, despite having a larger memory footprint. On the downside, these kinds of methods require large datasets for learning relevant features from collected eye-tracking data, either from images or numerical data (e.g., head orientation). Rather than showing a similar data scheme, widespread datasets have different features or similar ones captured in different lighting conditions, thus hardening the simultaneous use of various datasets. For instance, two of the previously revised driving datasets do not detect the gaze vector, but a gaze zone; Ortega et al. (2022) labelled ten zones, whereas Naqvi et al. (2018) distinguished seventeen. Still, Kothari et al. (2022) proved that feeding networks with multiple datasets contributed to obtaining better results. Another yet barely explored area is the synthetic generation of datasets, which together with realistic shading, may help to construct huge datasets including users from every possible demographic group, without the time-consuming tasks of gathering participants, labelling and cleaning data. Furthermore, few works have exploited numerical data together with images in machine learning (Wong et al. 2019). In this regard, the most recent datasets are providing further data, such as head and hand pose (Emery et al. 2021). Thus, the main limitations concerning machine learning on eye-tracking are the shortfall of datasets and the disparate range of features published along with them.

Further complicating the scenario is the potential data overload. The granular capture of eye movements within an intricate VR environment can generate vast datasets. The resultant data not only poses storage challenges but also demands robust algorithms for efficient processing and meaningful analysis. Advanced data compression algorithms, real-time analysis, and cloud-based solutions can help manage the vast amount of data generated. Machine learning models can further aid in filtering out noise and focusing on significant patterns within the data.

Although machine learning techniques are expected to be more robust by learning from huge datasets, these have not yet extensively surpassed feature and model-based methods, or at least, should be checked against similar data. Besides this, the use of one over others may be suffixed to the system requirements. While most of the current research is devoted to desktop applications, there exist other use cases which may not fit in this field. For instance, (Qian et al. 2021) integrated a custom eye-tracking system in an MRI scanner, whereas Katrychuk et al. (2019) checked DL networks with different capacities to be integrated into a Raspberry Pi 3. In conclusion, there is no better technique regarding efficiency for every possible scenario; computer vision pipelines are typically fast enough due to the low dimensionality of IR images, especially in feature-based methods. Machine learning offers, on the other hand, low latency while having higher memory requirements, although this depends on the number of trained weights. Otherwise, it is possible to develop shallow CNN with few parameters that are still able to recognize high-level features.

Another key factor in eye-tracking is rendering. Optimizations concerning VR have achieved a notable level of maturity with MVR, and the horizon points toward supporting a higher number of simultaneous renderings as GPUs increase their capacity. However, optimizations focused on eye-tracking must be still polished. These techniques are referred to as foveated rendering, and despite having their origin decades ago, various bottlenecks remain. Among these, spatial artefacts that make VR experiences not seamless, including aliasing, flickering (motion aliasing) and temporal artefacts, are the most frequent. These artefacts arise due to the rendering differences between peripheral and non-peripheral areas, regardless of the degradation technique employed. Furthermore, the peripheral areas are especially sensitive to contrast changes, even more than stereoscopic depth. Studies combating this problem, for instance by blurring the limits between distinct areas, have not yet completely solved it since they also modify the image contrast (Mohanto et al. 2022). As occurred in eye-tracking, rendering can also be helped by machine learning, although it is still in an early stage which makes these solutions not robust enough for their commercialization. Nevertheless, previous work has already achieved the completion of images using GAN networks (Kaplanyan et al. 2019). Matthews et al. (2020) have also suggested that multi-rate shading may be implemented with machine learning.

While the aforementioned challenges provide a candid understanding of the current landscape, they simultaneously shed light on the immense potential awaiting realization. As technology and research methodologies evolve, the integration of eye-tracking in VR is poised to open up a plethora of transformative applications that transcend existing paradigms. Delving into the imminent horizon, we discern several promising trends and applications that harness the combined prowess of eye-tracking and VR, sculpting the future trajectory of immersive experiences.

The ever-evolving landscape of VR, augmented by eye-tracking capabilities, is poised to reshape several domains of human–computer interaction. A relatively untapped avenue is the operability of these systems with multiple users on non-single-view displays. This future line is especially relevant as displays tend to grow in size, together with light field displays that enable watching a scenario from different perspectives (Spjut et al. 2020). Hence, narrowing down the number of perspectives to be rendered and discarding those not directed toward any viewer may help in reducing computations. Building upon this, the adaptation of user interfaces utilizing gaze patterns offers tantalizing prospects. Predictive algorithms can leverage eye-tracking data to present a real-time, user-centric interface, thereby enhancing usability and reducing cognitive load (Plopski et al. 2022). This principle is extended further in interactive gaming. As gaming continues to be at the forefront of VR innovations, integrating eye-tracking could redefine gameplay mechanics, making them more immersive and challenging. (Heilemann et al. 2022; Gemicioglu et al. 2023). Simultaneously, as the virtual domain becomes more intricate, the nuances of human behavior become even more crucial. In this context, the integration of realistic ocular movements within virtual entities enhances the verisimilitude of social interaction in VR. The subtleties of eye movements, intrinsic to authentic human communication, are paramount. Through the accurate replication of these nuances, virtual entities can attain a higher degree of anthropomorphic realism, thus facilitating genuine human-avatar interactions (Visconti et al. 2023).

Moreover, in the academic world and professional world, this confluence of VR and eye-tracking is proving invaluable. For the research community, particularly those in cognitive sciences, the amalgamation of VR and eye-tracking presents a robust tool for experimental paradigms. By monitoring ocular movements within controlled VR scenarios, it becomes feasible to derive insights into intricate cognitive processes and behavioral processes (McNamara and Mehta 2020). Our further studies will be focused on this line of research. Furthermore, eye-tracking technology holds significant promise for professional skill augmentation, particularly in fields where precision and focus are paramount. Real-time feedback mechanisms, derived from eye-tracking in VR modules, can be employed in specialized training scenarios, such as surgical simulations or athletic drills, thereby facilitating accelerated and refined skill acquisition (Cowan et al. 2021; Stoeve et al. 2022; Galuret et al. 2023; Pastel et al. 2023). The data-rich domain that this confluence promises can also be a game-changer for machine learning. The voluminous data from eye-tracking can refine predictive models, aiding in understanding nuanced user behaviors or even identifying potential health concerns. This has therapeutic implications as well. By analyzing ocular responses to specific virtual stimuli, therapeutic strategies can be optimized for conditions such as Post-traumatic stress disorder (PTSD) or specific phobias (Diemer et al. 2023; Fehlmann et al. 2023). In addition, the vast data generated from eye-tracking in VR stands to redefine content recommendation algorithms. Through gaze-based data, platforms could offer hyper-personalized content suggestions, further enhancing user experience (Pfeiffer et al. 2020). This gaze-data also has significant implications in marketing. Analogous to the analysis of click-through rates on contemporary digital platforms, the scrutiny of gaze durations on specific virtual entities or promotional content can be leveraged to fine-tune marketing paradigms, tailoring them to individual user inclinations (Burke and Leykin 2014).

Lastly, the convergence of VR and eye-tracking presents novel opportunities in the realm of digital security. Unique biometric signatures derived from eye movements and retinal patterns could serve as robust authentication mechanisms within virtual environments (Lohr and Komogortsev 2022).

In essence, while challenges persist, the future teems with opportunities, promising a synergy between virtual reality and eye-tracking that could reshape numerous facets of our digital interactions.

9 Conclusions

Although other surveys have discussed some features of eye-tracking systems in different areas such as performance, usability, or trends, this study performs a comprehensive analysis of eye-tracking technology embedded in HMDs in terms of use, integration, and implementation. Besides commercial devices, the infrastructure and implementation of custom and inexpensive devices was also revised to tackle the shortage of the former, as well as to provide a more customized and tailored solution for specific needs and requirements. In addition, this technology has been widely adopted by a large number of applications in recent years, including research reviewed in this survey in fields such as psychology, marketing, and human-computer interaction, as well as practical applications in areas such as assistive technology and user experience design.