1 Introduction

Convolutional Neural Networks (CNNs) are a subset of Machine Learning (ML) within the Deep Learning (DL) model family, commonly used in image and video analysis. They excel in detection, recognition, reconstruction, and object tracking, making them valuable for virtual reality (VR), augmented reality (AR), mixed reality (MR), and extended reality (XR) applications (Alzubaidi et al. 2021). The COVID-19 pandemic (Sarfraz et al. 2021; Lee and Trimi 2021) and Facebook’s shift towards Meta (Kraus et al. 2022) have accelerated advancements and popularity in these fields.

1.1 Contribution

In recent years, there has been an increase in the usage of CNNs in the field of computer vision due to their ease in processing images or video and recognising or classifying their content (Alzubaidi et al. 2021; Bhatt et al. 2021) On the other hand, the use of Virtual, Augmented, Mixed, and Extended Realities has increased in popularity and interest, even in business fields. The use of virtual elements in socialisation, marketing, exploration, and education, among others, is becoming more common every day. Given these advances, it is essential to know the use of CNNs in the various fields of XR over the past decade.

Given the strong focus of CNNs on image and video processing, their application in the field of XR is particularly promising. While other deep learning methods are effective in various domains, CNNs are distinguished by their exceptional efficiency in processing visual data, making them potentially suitable for enhancing immersive experiences in XR environments.

In this paper, we explore the use of CNNs in the field of VR/AR/MR/XR. Based on the problem described above, this article aims to answer the following research question: "What is the use of CNNs in VR, AR, MR, and XR as a whole?". To answer this question, a systematic literature review has performed.

1.2 Structure

This work is divided into the following sections: in section  2, there is a description of the key concepts related to our work. In Sect. 3, the related studies that partially address the research question within the field of study are analysed. In Sect. 4, we outline the methodological process followed in the systematic review. In Sect. 5, the different classifications proposed in this research, their features, and distributions are depicted. In Sect. 6, we discuss the study’s scope, its limitations, and potential scenarios not considered in the research. Finally, in Sect.  7, we conclude with the most relevant finds.

2 Background

In this section, we address the concepts regarding to immersive technologies such as VR, AR, MR, and XR. We also go in depth with the concept of Deep Learning, which is a part of the of machine learning, specifically, Convolutional Neural Networks.

2.1 Clarifying immersive technologies: virtual, augmented, mixed, and extended realities

Immersive technologies such as VR, AR, MR, and XR each offer unique ways of blending digital elements with real-world experiences, but they do so in distinctly different manners. VR provides a completely immersive experience where users interact within a computer-generated environment using specialised devices, completely isolating them from the real world to create a strong sense of presence (Burdea and Coiffet2017). In contrast, AR overlays digital objects onto the real world, allowing users to see and interact with a blend of real and virtual elements without losing touch with their surroundings (Azuma 1997). MR goes a step further by not just overlaying but also anchoring virtual objects to the real world, enabling interactions that affect both the digital and physical elements simultaneously(Milgram and Kishino 1994). Finally, XR encompasses all the aforementioned technologies, serving as an umbrella term that covers the entire spectrum of interactions between the digital and physical realms(Billinghurst and Nebeling 2021; Gugenheimer et al. 2022). Each technology plays a unique role in how users perceive and interact with the blend of real and virtual environments, from complete immersion in VR to seamless integration in MR, and a holistic approach in XR.

Fig. 1
figure 1

Illustration of the virtual technologies analysed. Own elaboration

2.1.1 Virtual reality

VR, is defined by Burdea and Coiffet in their book "Virtual Reality Technology" (Burdea and Coiffet 2017), as an immersive technology that allows users to interact with a three-dimensional environment generated by a computer. Through specialised input and output devices, such as virtual reality glasses, VR immerses the user in a virtual environment, allowing them to experience and manipulate this environment as if it were real. This total immersion can generate a feeling of presence, that is, the perception of truly being inside the virtual environment.

2.1.2 Augmented reality

AR, is defined as a variant of virtual environments that complements reality by allowing the user to see the real world with overlaid or combined virtual objects (Azuma 1997). Unlike VR, which fully immerses the user in a synthetic environment, AR maintains the user’s connection to the real world, creating an experience in which virtual and real objects seem to coexist in the same space (Azuma 1997). AR is positioned between the entirely synthetic and the entirely real (Milgram and Kishino 1994).

2.1.3 Mixed reality

MR, is a concept that lies within the reality-virtuality domain, proposed by Milgram and Kishino (1994), where it combines elements of reality and virtual environments to create enriched, unique experiences. This technology allows the simultaneous communication and integration of digital objects and elements with the real environment, creating a space in which both real and virtual objects coexist and influence each other (Milgram and Kishino 1994). In this way, MR enriches the user’s perception and virtual integration with the surrounding world, providing a balance between reality and virtuality.

2.1.4 Extended reality

XR, is an inclusive term that encompasses all immersive technologies that merge the physical world with the digital, such as VR, AR, and MR (Billinghurst and Nebeling 2021; Gugenheimer et al. 2022; Mhaidli and Schaub 2021; Ratclife et al. 2021). This closeness to the virtual and the real can be observed in the Fig. 1 To simplify the classification, in this research, XR will be named to refer to those works that within their structure contemplate or use CNNs in VR, AR, and MR.

2.2 Deep learning and convolutional neural networks

DL, according to Lecun et al. (2015), is a branch of machine learning based on the concept of deep neural networks. It is characterised using multiple layers of processing nodes, or "neurons," on each one can learn representations of data at different levels of abstraction. These representations allow deep neural networks to automatically learn from raw data, making them particularly effective for tasks such as voice recognition, computer vision, and natural language processing.

CNNs are part of Deep Learning networks and, being also models used in image and video analysis. According to LeCun et al. (1998), they are a type of deep neural network especially effective in image processing tasks. CNNs are characterised by their ability to automatically learn hierarchical representations of data through the application of convolutional filters, allowing the detection of local features and patterns in images. Additionally, CNNs are translation-invariant, that is, they can recognise a feature anywhere in the image, regardless of its location. This makes them particularly useful for tasks such as object recognition and medical image analysis.

In the context of image processing, CNNs are distinguished by their specialised architecture. Designed specifically for handling visual data, CNNs are composed of three main types of layers: convolutional layers, pooling layers, and fully-connected layers. The combination and stacking of these layers form the architecture of a CNN. Fig. 2 illustrates a simplified CNN architecture for MNIST classification based on LeCun et al. (1998), O’Shea and Nash (2015).

Fig. 2
figure 2

Illustration of an example of CNN architecture. Based on LeCun et al. (1998), O’Shea and Nash (2015)

The convolutional layers apply filters over the input image to detect important features, such as edges and textures. The filters move through the image and calculate how they align with different parts of the image, creating a feature map that highlights the detected elements (O’Shea and Nash 2015; Li et al. 2022).

The pooling layers reduce the size of the feature maps generated by the convolutional layers. This is done by selecting the maximum or average value of small regions of the map, helping to reduce the amount of data and making the network more efficient and resilient to small variations in the image (O’Shea and Nash 2015; Li et al. 2022).

The fully-connected layers take the reduced feature map and transform it into a one-dimensional vector. Each neuron in these layers is connected to all neurons in the previous layer. This final vector is used to perform classification, assigning a score to each possible class to decide which category the image belongs to O’Shea and Nash (2015), Li et al. (2022).

2.3 Terminology in the classification of interaction

In this section, we broadly define some concepts used as categories in the proposed classification of CNNs use in XR technologies, considering interaction. One of the concepts used is the Brain-Computer Interface (BCI), a system that allows direct communication and control between the human brain and electronic devices by the translation of neural signals into commands that can be interpreted by a computer (Lotte 2014). These systems often use electroencephalogram (EEG) signals, a non-invasive technique that measures and records brain electrical activity using electrodes placed on the scalp. This technique provides a real-time representation of the human brain activity using different frequency waves (Chartier et al. 2009). Human-computer interaction (HCI) is a multidisciplinary field that studies the design and use of computer technology, focusing on the interfaces between people (users) and computers. The main goal of HCI is to improve the interaction between users and computer systems, making it more efficient and effective (Rogers 2005). Gestures are defined as a physical movement or posture performed with some part of the body, primarily the hands, to convey information or interact with the environment. In computing, gesture recognition refers to the ability of computer systems to interpret these human gestures accurately and effectively. Moreover, it allows for a more intuitive and natural interaction between humans and machines (Mitra and Acharya 2007).A common term in this field is avatar, referring to a user’s digital representation, often used in online environments such as video games, internet forums, and virtual communities. These avatars can vary in realism and can significantly affect interaction and communication in virtual environments. Avatars can influence aspects like information disclosure, nonverbal communication, emotional recognition, and the sense of presence in dyadic interaction (Bailenson et al. 2006). Foveated refers to rendering techniques inspired by the human eye’s structure. These techniques focus on the detailed representation of a limited portion of vision, known as the fovea, where visual acuity is highest and represents peripheral areas with less detail (Guenter et al. 2012). Long Short-Term Memory (LSTM) networks are a special class of recurrent neural networks (RNN) designed to avoid the gradient vanishing problem in learning long sequences. LSTMs networks use gates, units that can allow or block information based on certain criteria to regulate the information that is retained or discarded over time in the network’s memory. This gate structure helps LSTMs retain relevant long-term information while forgetting non-essential details, making them highly effective for many tasks involving sequential data, such as natural language processing, time series analysis, and voice recognition (Hochreiter and Schmidhuber 1997). Besides, they are related to CNNs in the CNN-LSTM model (Tara et al. 2015).

2.4 Terminology in the classification of executing

Photogrammetry is a scientific method used to obtain information about the physical properties of objects and the environment through the interpretation and analysis of photographs and light patterns (Cleveland and Wartman 2006). This technique is widely used to create maps or estimate the geometry of a scene. According to Chen et al. (2016), semantic segmentation refers to the task of assigning a semantic label to each pixel in an image. This computer vision technique aims to understand images at a more detailed level and provide precise descriptions of scenes. Object Detection (OD) refers to the process of identifying specific instances of objects from certain classes, such as people, items, animals, etc., within an image or video. This task not only involves classifying which object is presented but also locating that object in space, usually represented by a bounding box around the object. OD differs from semantic segmentation in that the latter labels each pixel in an image with the object class to which it belongs, rather than just placing a bounding box around the entire object. Therefore, semantic segmentation provides a much more detailed and precise understanding of the image content, but it is also computationally more demanding (Girshick et al. 2013).

2.5 Terminology in the classification of creation

Simultaneous Localisation And Mapping (SLAM) is a technique belongs to robotics that allows an autonomous device to generate a map of its environment and orient itself within it simultaneously. The SLAM technique combines data from various sensors to calculate the device’s position and update the environment’s map in real-time (Cadena et al. 2016). In addition, Visual SLAM (VSLAM) is a variant of SLAM that uses cameras as the primary sensor for localisation and mapping. VSLAM can employ one or more cameras to capture images of the environment and use key visual features to calculate the device’s position and create or update the map (Cadena et al. 2016). Some of these applications use the corresponding coordinates in the 3D domain to represent a point. The grouping of these in the reconstruction is known as a point cloud. A notable example of key point extraction techniques based on CNNs is SuperPoint, which was a leading contender in the CVPR2020 image-matching challenge (Liu et al. 2022) (Table 1).

3 Related work

Table 1 Comparison of different SLRs and their focus areas compared to the proposed SLR on the use of CNNs in AR, VR, MR, and XR

Several literature reviews have been published that consider CNNs in the field of AR, VR, MR, or XR. Most related works focus on a specific approach to the use of convolutional net-works, where authors envision the application of their research in some area of XR. In Minaee et al. (2021), authors conducted a meticulous examination of image segmentation algorithms based on deep learning techniques. These algorithms were grouped according to their architectural categories, aiming to perform a quantitative evaluation of their performance (Chitty-Venkata and Somani 2022). In Chen et al. (2021), the focus was on understanding fluid simulation in the field of ML. They also referenced papers that use CNNs as a data-based Eulerian solver and improve their simulations. The authors were enthusiastic about potential applications in VR fields, with significant potential for use in AR/VR. Jiang et al. (2022) reviewed animal pose estimation. They noted that some research still manually creates silhouettes, but through semantic image segmentation using CNNs. Also, it was possible to separate both the silhouette and the pose of objects, humans, and even animals (Cheng et al. 2023). Another related work was a review where 2D instance segmentation types and various CNNs architectures are discussed (Gu et al. 2022).

In other research work, the performance of some key disparity estimation methods in images was compared, evaluating depth, an essential element in scene reconstruction tasks, AR, and 3D modelling (Laga et al. 2020; Xu et al. 2019). The latter could be implemented using point clouds and CNNs (Zhang et al. 2020; Wang et al. 2020; Zhang et al. 2023, which require subsequent processing to be used in the virtual reality graphic domains (Fahim et al. 2021). The identification of characteristic points (Zhang et al. 2020; Laga et al. 2020) and scene recognition for indoor navigation or movement were already part of the role of CNNs (Khan et al. 2022). As seen in SLAM, a convolutional network was trained instead of building an explicit map, thus maintaining a constant size without experiencing linear growth problems (Yan and Zha 2019).

In Wei et al. (2019), a method for stitching images or videos through semantic matching was proposed. Besides, its outstanding features were extracted using CNNs. It was possible to construct a panoramic view from a sequence of images, and this panorama can be displayed in VR (Wei et al. 2019). In Hamza and Dao (2022), authors reviewed techniques for sensors that preserve user information, showcasing a use case employing CNNs, still within the context of AR/VR.

Estimating depth, as well as determining the position and detection of objects, are relevant topics in AR, VR, MR, or XR. In Sahin et al. (2020), a categorisation based on mathematical models of object pose estimation methods (Yan and Zha 2019; Zou et al. 2023) or human motion and pose (Desmarais et al. 2021) was established, both in 2D (Wang et al. 2021; Han et al. 2017) and 3D (Gamra and Akhloufi 2021; Ji et al. 2020; Chen et al. 2020). It concluded that the most accurate techniques utilise various architectures. Another related work is Hoque et al. (2021); thanks to this review, different architectures used for object detection with their pose estimation and their respective viewpoint (Zhang and Fei 2019) and their edges (Han and Zhao 2019) could be identified. Scale is a factor in such recognitions, with CNNs being among the techniques for detecting small or tiny objects (Tong and Wu 2022).

Quality evaluation of panoramic content is essential in virtualisation and immersive applications. Another related work is a literature review that analysed the techniques and measurements of \(360^\circ\) content. In this work, related articles were found that, through video patch samples, it obtained a quality score using the 3D-CNNs architecture (Ullah et al. 2022).

To effectively apply CNNs within extended reality technologies, understanding their use is paramount. This becomes especially relevant when considering the need to equip both students and professionals for evolving professions by offering a flexible virtual environment for real-world case studies, leveraging AR/VR-enhanced learning spaces (Bermejo et al. 2023)

HCI is another related field through gesture recognition in AR/VR domains (Yuanyuan et al. 2021), For example, hand poses (Huang et al. 2021), as the BCI (Hu et al. 2022; Miltiadous et al. 2023), or the study of the foveated visual approach, where eye movement was predicted using CNNs in AR/VR platforms (Mohanto et al. 2022). In D’Orazio et al. (2016), they emphasised the importance of depth in the HCI field, and within a category of methods. Therefore, they found the use of CNNs in hand gesture recognition. In Yang et al. (2019), different algorithms used for gesture recognition in virtual reality are represented; it can also be seen that the combination between CNNs and LSTM is useful in this field.

4 Methodology of the systematic literature review

The goal of this paper is to present the last thirteen years of research on the use of CNNs in XR. A systematic literature review (SLR) is necessary to achieve our research question. Through a SLR, we could identify, evaluate, and select relevant information in a rigorous and structured manner. For this, this section presents a detailed description of the literature selection process based on Kitchenham et al. (2009) between the years 2010 to 2023.

4.1 Databases

From the width spectrum of the current scientific databases, we selected: Explorer of the Institute of Electrical and Electronics Engineers (IEEE): IEEExplorer, Science Direct, and Clarivate Web of Science (WoS).

4.2 Keywords

In order to perform the most suitable query in the previous databases, we defined the following keywords: “CNN”, “Convolutional Neural Network”, “Virtual Reality”, “VR", “Augmented Reality", “AR", “Mixed Reality", “MR", “Extended Reality", “XR" and “Metaverse". The keyword “metaverse” is included as a conceptual extension of Virtual Reality, due to the recent popular context of the name change of one of the most used social platforms in the world.

Considering the specified keywords and the research questions to be answered, we built the following logical expression: (“Convolutional Neural Network" OR “CNN") AND (“Virtual Reality" OR “Augmented Reality" OR “Mixed Reality" OR “Extended Reality” OR “AR/VR” OR “MR/XR” OR “Metaverse”). The logical expression was evaluated from 2010 to 2023, narrowing the results through specific filters for each database. As a result, we obtained a total of 844 research papers available.

4.3 Search and selection process

The performed queries in each database and its respective filters are described through this section, together with the filters used, the adaptations and limitations of each search engine (see Fig. 3).

Fig. 3
figure 3

Illustration of the SLR construction process. Own elaboration

In IEEE database, the previous keywords were used, adding a characteristic of the advanced search engine (the use of asterisks to perform an extended search on different conjugations of the term) Then, the logical expression used in IEEE was: ((“Convolutional Neural Network*” OR “CNN*” ) AND (“Virtual* Reality*” OR “Augmented* Reality” OR “Mixed* Reality” OR “Extended* Reality” OR “AR/VR” OR “VR/AR” OR “Metaver*” OR “MR/XR”)). Additionally, the filters were used from Journals and Magazines.

In ScienceDirect the asterisks are removed since its search engine does not have this feature. Then, the logical expression used was: (“Convolutional Neural Network” OR “CNN”) AND (“Virtual Reality” OR “Augmented Reality ” OR “Mixed Reality” OR “Extended Reality” OR “AR/VR” OR “VR/AR” OR “Metaverse”). In this case, it was necessary to remove prepositions since this engine does not allow the use of more than 8 logical conditionals. To address potential gaps, an additional search was conducted, ensuring consistency in search logic. For both searches, the applied filters were: Years between 2010 and 2023. Type of article: Review articles, Research articles. Journals selected were those ranked in quarterlies Q1 and Q2. Subject areas: Computer Science.

In the WoS database, the search query was adjusted by prefixing with ’All’, denoting a search across all editions. The search logic employed was: ALL=((“Convolutional Neural Network” OR “CNN”) AND (“Virtual Reality” OR “Augmented Reality” OR “Mixed Reality” OR “Extended Reality” OR “AR/VR” OR “VR/AR” OR “MR/XR” OR “Metaverse”)). The scope was restricted to publications from 2010 to 2023. Applied filters include: Type of Documents: Review articles, Research articles; Journals selected from quarterlies Q1 and Q2.

Fig. 4
figure 4

Articles available in the selected topics and databases. Own elaboration

To ensure rigorous selection and systematic analysis of the literature, we refined the description of our inclusion and exclusion criteria. Initially, our search yielded 844 works, as seen in intersection "D" of Fig. 4. These articles were then subjected to a detailed evaluation process, beginning with an initial screening based on the title, abstract, introduction, and conclusions.

Inclusion criteria for the review were as follows: articles had to specifically discuss the application or potential application of technologies within extended reality environments, involve or discuss the implementation of convolutional neural networks, and be published in journals and magazines to ensure credibility and scholarly value. Exclusion criteria included articles that did not specifically address extended reality technologies or did not incorporate CNNs, as well as incomplete or preliminary studies based solely on abstracts without full studies. This meticulous categorisation and selection process led to the final inclusion of 348 articles, which were further organised by categories relevant to the review’s focus.

4.4 Bibliometric results

The 348 selected works were analysed bibliometrically, obtaining useful information, such as the year of publication (see Table 2). Since 2018 is a turning point in these research works, we divided the time spectrum into two blocks: before 2018 and after that. As shown in Fig. 5. There was an increase in research works before 2020. The 2022 is the year in which the most research works are published. Furthermore, between 2019 and 2020, the number of research works doubled compared to the previous period and although an increase in research is evident in the years 2021 and 2022. It does not represent a slope as steep as that of the period between 2018 and 2019. Additionally, a drastic decrease in articles is evident in 2023 until August.

Fig. 5
figure 5

Number of publications per year. Own elaboration

Table 2 Summary of references by year

5 The use of CNNs in extended reality environments

Fig. 6
figure 6

Categories and subcategories. Own elaboration

In this section, we address the classification of research works focusing on crucial aspects that underline our goal of understanding the use of CNNs in AR/VR/MR/XR. We examine the following fundamental characteristics: interaction, execution, and creation to effectively cover the different research areas and applications of these technologies. The interaction category focuses on how users interact with virtual environments, and how CNNs can enhance these interactions, including gesture recognition, EEG, gaze tracking, and the integration of virtual objects into the physical environment, allowing for more immersive and intuitive experiences. The execution category focuses on the implementation and performance of XR applications, including performance optimisation, context adaptation, and personalisation of the experience based on user and environment features, resulting in smoother, more efficient, and adaptive XR experiences, enhancing user satisfaction and the adoption of these technologies in various fields and applications. Finally, the creation category refers to the generation of content and visual elements for XR environments using CNNs, addressing texture synthesis, 3D model generation, character animation, and the creation of realistic virtual environments, facilitating and accelerating the content production process and allowing for more detailed and customised results Table 3 displays the classification of the selected articles, separated by rows into interaction, execution, and creation sections and by columns into different XR technologies. It’s important to clarify that the classification between interaction, execution, and creation is mutually exclusive, while between the VR, AR, MR, and XR technologies, it is non-exclusive. Additionally, Fig. 7 shows the number of articles per year in the mentioned classifications, and Fig. 8 displays their percentage distribution, with the execution category having the highest concentration. In order to detail the articles included in interaction, execution and creation, an internal classification of each of these sections is proposed, using the characteristics of the different articles and maintaining the classification of XR. In Fig. 6 we can see the different categories in which the articles found were classified. In the following sections you will find details about each of these categories.

Fig. 7
figure 7

Publications by classification and by year. Own elaboration

Table 3 Classification of using CNNs in VR/AR/MR/XR
Fig. 8
figure 8

Percentage distribution of articles in creation, execution, and interaction. Own elaboration

The Fig. 8 is divided into three main categories, creation, represented by the colour orange, this category takes up 24% of the total number of items. Execution, represented by yellow, this is the largest category, representing 43% of the total articles reviewed. It includes research focused on the implementation and performance of XR applications. It focuses on topics such as performance optimisation, identification, segmentation and tracking of objects or contextual elements, object motion and pose, and personalisation of the experience based on user and environment characteristics. And interaction category, represented by the colour green, this category makes up 33% of the total number of articles.

5.1 Interaction classification

In the context of CNNs in XR, the category ’interaction’ refers to the dynamic exchange between users and virtual environments, where CNNs process and respond to user inputs, enabling engagement with XR content. Considering this fundamental category, a classification in this area (and sub-areas) is proposed. In Table 4, articles related to the interaction between users and AR/VR/MR/XR environments using CNNs can be observed, where the highest concentration is in the classification of HCI and gestures Fig. 9.

Table 4 Interaction classification
Fig. 9
figure 9

Percentage distribution of XR technologies interaction subcategories. Own elaboration

The Fig. 9 shows that 41% of the articles focus on human-computer interaction and gestures, followed by hand detection with 15% and gaze tracking with 11%. The subcategories of body recognition, use of sensors, facial recognition, emotions, tools and audio represent between 5% and 7% each, while brain-computer interface is the lowest with 2%. This indicates that most research prioritises improving the interface and natural interaction with XR technologies, with significant interest in methods such as hand detection and eye tracking.

5.1.1 Brain-computer interface

In the BCI subcategory, articles that use brain signals in the field of study of this research were classified. In this subcategory, it was found that CNNs are used to decode movement. As in the case of Achanccaray and Hayashibe (2020), which decodes EEG signals to obtain hand movement. In this field, it is also possible to detect the level of acrophobia, using EEG and a ResNeT network, in VR environments. In Wang et al. (2021), the levels of a user’s acrophobia in VR are classified.

5.1.2 HCI and gestures

One of the most relevant category in the interaction classification is in the HCI and gestures subcategory. Here, we find articles that use CNNs to improve the interface and immersion, such as "Deep Spherical Harmonics Light" (Mohammed et al. 2019). This technique estimates the lighting configuration of the real environment from a single RGB image, without prior knowledge of the scene. It helps solving inconsistencies in lighting that could occur when integrating the user’s hands and virtual elements, enhancing the user’s immersion in the MR application. In Sen et al. (2022), Sagayam et al. (2022), He et al. (2022), Ge et al. (2022), Yao and Qiu (2021), Alam et al. (2022), Jia et al. (2021), Bose and Kumar (2021), Kang et al. (2020), Polap et al. (2020), Xu et al. (2020), Li and Fan (2020), the focus is on hand gestures, improving interaction processes in virtual environments or using heat sensors to capture hand movements and enable writing (Kim et al. 2017). Additionally, in Liu and Pan (2022), a system is proposed that enhances the perception of the virtual world, capturing and transmitting user movement information to a supervised model. This system quickly collects and analyses data from multiple sources and provides feedback to the mobile device about the ongoing activity. This allows the user to move and make simple gestures in a specially designed space, among other things, distinguishing the player’s change of direction in real-time. In Karambakhsh et al. (2019), AR and CNNs are used to enhance teaching in medical education, specifically in anatomy teaching. The authors propose a CNNs for gesture recognition, which interprets human gestures as specific instructions. They use AR technology to simulate scenarios where students can learn anatomy using HoloLens, instead of real specimens that can be difficult to obtain. Their approach is not only more accurate, but it also has more potential to add new gestures, highlighting different models.

5.1.3 Foveated and ocular visualisation

In the subcategory of foveated and ocular visualisation, the focus is on eye tracking, attention identification, and gaze recognition. In Hu et al. (2020), authors propose training a CNNs to directly segment the full elliptical structures of the eye. They argue that this framework is more robust against obstructions than the previous ones and offers a higher performance in tracking the pupil and iris. Compared to using standard segmentation of parts of the eye, the authors claim that their method improves the detection rate of the pupil and iris centers by at least 10% and 24% respectively (within a margin of error of two pixels). Both segmentation, biometric verification  (Boutros et al. 2020), attention (Dai et al. 2021; Lee et al. 2019), and gaze tracking (Hu et al. 2021) or its prediction (Yuan et al. 2017) are part of the use of CNNs in AR and VR. In Dai et al. (2022), the authors propose a new gaze-tracking method based on the fusion of binocular features and a CNN. This method integrates both local (LBSAM) and global (GBSAM) binocular spatial attention mechanisms into the network model to improve accuracy. LBSAM is a mechanism used to distinguish the importance of different regions of the two eyes, aiming to enhance gaze-tracking accuracy. GBSAM spatially weighs the head, face, and image angle to include global variables in gaze tracking. Additionally, the authors validated this method using the GazeCapture database. In Olszewski et al. (2016), CNNs are used to map images of a user’s mouth region to the parameters controlling a digital avatar in VR. However, the authors demonstrate that their approach can also track expressions in the user’s eye region using an internal infrared camera, allowing for complete facial tracking. In Kothari et al. (2021), they perform facial recognition considering AR filters applied to the face and use distance algorithms and different facial landmarks, especially the eyes. In Huong et al. (2022), the authors develop a machine learning model for assessing the quality of 360\(^{\circ }\) images in VR, utilizing foveated technologies. Foveated technologies leverage the focusing feature of the human eye, which significantly reduces the data required for transmission and computational complexity in rendering. This is important because 360\(^{\circ }\) images, a key component of VR systems, are typically large and therefore require efficient transmission and rendering solutions.

5.1.4 Hand detection

In the subcategory of hand detection, the articles that use CNNs in VR, AR, MR, or XR to identify and track hands are grouped. It is important to note the overlapping between this category and the gestures one.. Hands detection allows for a more fluid and natural interaction between the user and the virtual, augmented, or mixed environment. For example, a user might interact with virtual objects using hand gestures  (Wang et al. 2022; Cofer et al. 2022; Yuan et al. 2021; Achanccaray and Hayashibe 2020; Aly and Aly 2020; Zhou et al. 2019; Emporio et al. 2022; Caputo et al. 2021; Li and Zhao 2021; Li et al. 2020; Zhang and Chi 2020; Malik et al. 2019; Gomez-Donoso et al. 2019; Marques et al. 2018; Li et al. 2022; Sen et al. 2022; He et al. 2022; Ge et al. 2022; Yao and Qiu 2021; Polap et al. 2020; Xu et al. 2020; Mohammed et al. 2019). Instead of relying on traditional controllers or input devices. Estimating the pose of the hand and realism is crucial in VR/AR (Deng et al. 2021) stands out as it manages to determine the hand pose with great precision and realism using CNNs. Additionally, they generate a dataset for future applications. In Liu et al. (2021), the authors propose a 3D micro-gesture recognition system based on a 3D holoscopic image sensor. Due to the lack of 3D holoscopic datasets, they created a comprehensive 3D holoscopic micro-gesture database (HoMG) that is used to develop a robust 3D micro-gesture recognition method. They improve performance using multiple viewpoints from a single holoscopic image and apply a CNNs model with an attention-based residual block to each hand viewpoint image. In Wu et al. (2020), a system is proposed that uses depth images to accurately estimate a hand’s position in 3D space for XR applications. The CNNs is designed with a skeleton difference loss function that allows the CNNs to effectively learn the physical constraints of a hand. This enables accurate prediction of hand joint positions even in challenging environments or with occlusions.

5.1.5 Facial recognition

In the facial recognition subcategory, segmentation (Liang et al. 2019) is fundamental, as well as its recognition for the identification of expressions (Albraikan et al. 2022; Alashhab et al. 2022), emotions, similarities, and structures or features. In Zhou and Feng (2022), they propose M3SPCANet, a CNNs that uses multiscale PCA filters to obtain facial feature maps and enhances detection and recognition. In Refat et al. (2022), they combine the conventional steps of face detection, face alignment, feature extraction, and similarity computation into a single cohesive process. Face detection has its application in animation for VR. In Olszewski et al. (2016), they use recordings and expressions generated by artists to simulate facial expressions.

5.1.6 Body analysis for interactivity

The body analysis for interactivity subcategory focuses on the identification of the body, its actions, and the representations these can generate for interaction, such as the localisation of anthropometric landmarks from 3D body scans (Kozbial et al. 2020). In Kozbial et al. (2020), they propose an approach to detect and provide real-time feedback on body movement errors in physical training conducted in virtual reality. In Zherdev et al. (2021), they estimate the 3D human pose from a single image. Instead of relying on a single complex estimator, they use multiple partial hypotheses. With this approach, they select several joint groups from a human joint model and estimate the 3D pose of each joint group separately using CNNs. These pose estimates are then combined to obtain the final 3D pose.

5.1.7 Use of sensors

In the use of sensors subcategory, the selected works use sensor signals to improve interaction. In Wang et al. (2022), a wrist sensor is used, and different movement features are combined to classify 12 gestures; highlighting its utility in detecting gestures based on a wrist sensor (for example, through a smartwatch). In Smith et al. (2021), a safety control framework for human-robot collaboration is proposed, and both image analysis and the robot’s sensors are used to monitor safety in a digital twin using CNNs. In Brandolt Baldissera and Vargas (2020), the accuracy of manual operations in VR training systems is evaluated using gloves with sensors that collect precise data on hand movements. The authors assert that datasets from multiple sensors are seldom leveraged to assess actions in VR training systems. In Tao et al. (2020), worker activity recognition is carried out in AR in intelligent manufacturing systems using sensors and vision. The authors developed a dataset of worker activity, which includes six common activities in assembly tasks: grabbing a tool/part, hammering a nail, using an electric screwdriver, resting the arms, turning a screwdriver, and using a wrench. In Liu et al. (2020), the authors propose a mirror therapy system based on the recognition of multichannel signal patterns and mobile augmented reality. The overall accuracy of the SVM is 93.07%, while that of the CNNs reaches up to 97.8%. These results suggest that machine learning techniques can play a crucial role in enhancing the effectiveness of mirror therapy for the rehabilitation of post-stroke hemiparesis; alterations in the functioning of one side of the body.

5.1.8 Auditory processing

In the auditory processing subcategory, articles identify different sound inputs or outputs to enhance interaction. This includes voice recognition or the spatial application of sound in immersive technologies. In Siyaev and Jo (2021), the authors propose an MR-based solution for education and training in aircraft maintenance, specifically the Boeing 737, using smart glasses. The solution includes a deep learning-based voice interaction module that allows trainee engineers to control virtual assets and workflows through voice commands, freeing their hands for other tasks. In Lopez Ibanez et al. (2021), researchers developed a head gesture recognition system to identify and interpret human emotions, specifically fear. This system is designed to be integrated into an adaptive music system (LitSens) in virtual reality applications, aiming to improve immersion and virtual presence. In Amjad et al. (2022), emotions are identified from audio signals using CNNs and LSTMs. These models learned audio representations from deep segments of Mel spectrograms, which are visual representations of the spectral energy distribution of an audio signal. The models were trained using raw voice data, as well as Mel spectrogram segments from different perspectives (middle, left, right, and side), allowing the models to learn both local and global features of the audio signals. In Ling et al. (2020), UltraGesture is presented, a system for perceiving and recognising finger movements based on ultrasounds. UltraGesture uses the Channel Impulse Response (CIR), which detects and recognises small finger movements through sound.

5.1.9 Emotional analysis

In the emotional analysis subcategory, all the articles that manage to predict or categorise feelings are depicted, as well as disorders or some diseases. In Liang et al. (2019), a CNNs is proposed that efficiently classifies human facial expressions of emotions. Thus, from the face, different emotions can be classified (Albraikan et al. 2022; Xiao et al. 2020; Song et al. 2020; Izountar et al. 2022; Martínez et al. 2021; Chirra et al. 2021), or emotion recognition through speech (Mustaqeem Sajjad and Kwon 2020; Amjad et al. 2022). Consequently, in Chiu et al. (2020) an innovative and easy-to-use emotionally aware virtual assistant for university campus environments is presented, which improves efficiency in semantic interpretation and emotion identification through voice. In Hedman et al. (2022), the authors predict the degree of motion-induced dizziness when viewing a 360\(^{\circ }\) stereoscopic video. The method is based on the use of three-dimensional convolutional neural networks (3D-CNN) and considers the movement of the user’s eye as a new feature, to add to the speed and depth of motion characteristics of the video.

5.1.10 Interaction tools

In the interaction tools subcategory, articles are grouped which indirectly affect or use Interaction for a specific purpose. In Liu (2021) the focus is on developing a motion detection system using a CNNs for the recognition of high-difficulty sports movements. In this case, the CNNs is used to extract images and perform computational preprocessing for the recognition of each human motion image. In Mukhopadhyay et al. (2022), authors propose an application of detection technologies to improve workplace safety by measuring social distance between individuals. Using data visualisation techniques based on intermittent layers, heat maps, CNNs, and digital twins, they recognise proximities between people in VR. In Vaughan and Gabrys (2020), methods of scoring for personalised haptic virtual reality medical training simulators are proposed and evaluated. A novel approach called dynamic time warp multivariate prototypes (DTW-MP) was proposed and the VR data was classified into experience level categories: Beginner, Intermediate, or Expert. Various algorithms were used for classification, achieving different levels of accuracy: dynamic time warp with one nearest neighbour (DTW-1NN) achieved 60%, the SoftDTW nearest centroid classification achieved 77.5%, and deep learning: ResNet achieved 85%. Demonstrating the use of CNNs for VR interaction assessment. In Tai et al. (2021), an approach is proposed to improve the accuracy of image-guided lung biopsy in patients with COVID-19 through the combination of AR, custom haptic surgical tools, and CNN. The authors propose a personalised surgical navigation system that can adapt to the individual needs of each patient. The system’s performance was evaluated by 24 thoracic surgeons through objective and subjective tests. The results show that the use of AR with the deep learning model outperforms existing navigation techniques as of the year 2021, offering significantly better performance. In Liu et al. (2022), user interaction is used to predict potential hacker intrusions. It uses an intrusion detection simulation training system with a model identification based on CNNs and LSTM in VR.

Thanks to the analysis of the articles included in Table 4, it can be said that the use of CNNs plays a crucial role in improving interaction in virtual, augmented, mixed, and extended reality applications. They allow users to communicate and interact with their environment in a more natural, fluid, and accessible way. Additionally, they enable the identification, tracking, and real-time following of people or parts of their body in these environments, enhancing the interaction between the user and their virtual or augmented surroundings. Furthermore, they are used to track the position of the user’s eyes and determine where they are looking. This allows VR and AR applications to provide information and interaction options based on the user’s gaze direction, enhancing the interaction experience. They can also be used to recognise the user’s voice or translate language or gestures in real time. This allows for a more accessible and natural interaction with VR, AR, MR, and XR applications, especially for users with hearing or speech disabilities. Finally, CNNs are used in behaviour analysis and emotion recognition in virtual environments. This allows applications to adapt to user emotions and reactions, even detecting their dizziness or fear of heights, leading to offering adapted experiences and improving interaction.

5.2 Execution

It can be observed in Table 5, the articles that address the use of CNNs in the execution of AR/VR/MR/XR applications and systems. Research in this category focuses on topics such as performance optimisation, identification, segmentation, and tracking of objects or context elements. The movement and pose of objects, the experience based on user and environment characteristics. Showing a unique distribution of 50%, as seen in Fig. 10 of the articles in this category, between recognition and segmentation.

Table 5 Execution classification
Fig. 10
figure 10

Percentage distribution of execution subcategories in XR technologies. Own elaboration

The Fig. 10 shows that 29% of the articles focus on advanced object and scene recognition, followed by 21% on semantic and image segmentation. Object detection and tracking takes up 18%, while optimisation accounts for 13%. Motion tracking and recognition accounts for 10%, pose estimation and tracking 6%, and 360 degree content 3%. This indicates a primary focus on improving the analysis and understanding of scenes and objects to optimise functionality in XR technologies.

5.2.1 360 degree content processing and analysis

In the subcategory of 360 degree content processing and analysis, it refers to the full or perigonal angle. Here, the articles that contemplate immersive content are found. In Yang et al. (2018), the creation of the 360\(^{\circ }\) dataset is presented, and this study proposes an end-to-end 3D convolutional neural network to catalogue the quality of VR videos without needing a reference VR video. This method can extract spatio-temporal features, eliminating the need for manually designed features. In Irfan and Munsif (2022), the quality of panoramic videos and stereoscopic panoramic videos is evaluated. The proposed method combines spherical CNNs and non-local neural networks, enabling effective extraction of complex spatio-temporal information from the panoramic video. In Adhuran et al. (2022), researchers propose a new 360\(^{\circ }\) video encoding framework that leverages user-observed viewing information to reduce pixel redundancy in 360\(^{\circ }\) videos. By optimising areas with greater attention in 360\(^{\circ }\) content, the experience in VR is improved (Zhu et al. 2021; Su and Grauman 2022). In Su and Grauman (2021), visual recognition in spherical images produced by 360\(^{\circ }\) cameras is addressed. The authors propose learning a Spherical Convolution Network (SphConv) that translates a flat CNNs to the equirectangular projection of 360\(^{\circ }\) images. Given an original CNNs for perspective images as input, SphConv learns to reproduce the outputs of the flat filter in 360\(^{\circ }\) data, considering the variable distortion effects on the viewing sphere. Additionally, the authors present a Faster R-CNN model based on SphConv and demonstrate that it’s possible to use a spherical object detector without any object annotations in 360\(^{\circ }\) images.

5.2.2 Object detection and tracking

In the subcategory of object detection and tracking, the primary focus is on identifying and following objects. In Hoang et al. (2019), a rapid object detection approach based on deep learning is proposed to identify and recognise types of obstacles on the road, as well as to interpret and predict complex traffic situations. A single CNNs directly predicts regions of interest and class probabilities from full images in a single evaluation. In Huang and Yan (2022), the use of MR headset-mounted cameras for artificial vision-based object detection related to diet activities is proposed, followed by the subsequent display of real-time visual interventions to support the choice of healthy foods. In Thiel et al. (2022), the focus is on the classification and retrieval of 3D objects. The authors propose a novel method that combines a Global Point Signature Plus (GPSPlus) with a CNN. GPSPlus is a novel descriptor that can capture more shape information from a 3D object for a single 2D view. First, the original 3D model is converted into a colored one using GPSPlus. Next, the 2D projection obtained from this 3D-colored model is stored in a 32 \(\times\) 32 \(\times\) 3 matrix, which is used as input data for the Deep Residual Network that uses a unique CNNs structure. In You et al. (2021), the use of object detection algorithms to enrich visitors’ experiences at a cultural site is proposed, through the implementation of these algorithms on wearable devices, such as smart glasses. In Yu et al. (2022), authors proposed a solution for 3D object localisation with mobile devices. The proposed method combines a CNNs model for 2D object detection with AR technologies to recognise objects in the environment and determine their coordinates in the real world. In Lai et al. (2020), a methodology is introduced to address visual target tracking tasks. It involves using a CNNs capable of classifying a set of patches based on how well the target is centred or framed. To counteract potential interferences, the network is fed patches located around the object detected in the previous frame, and of different sizes, to account for potential scale changes and detect the shift. One of the most recent studies is Zhang et al. (2022), in which machine learning and synthetically generated data are used for the creation of object tracking configurations exclusively from this data. The data is highly optimised for training a CNN, providing reliable and robust results during real-world applications and using only simple RGB cameras.

5.2.3 Advanced object and scene recognition

In the subcategory of advanced object and scene recognition, an activity is addressed after the detection of the object or its possible movement. This is the section where the object or its attributes or their representation in a context are recognised, allowing the entire scenario to be recognised (Tang et al. 2022). In Polap et al. (2017), the relationship between the scene and the associated objects in everyday activities from an egocentric vision perspective, that is, from the observer’s point of view, is explored. The authors argue that daily activities tend to occur in prototypical scenes that share many visual features, regardless of who or where the video was recorded, thus recognising the context. In Su and Grauman (2021), a lightweight but powerful CNNs called Efficient Feature Reconstruction Network (EFRNet) is presented for real-time scene recognition. The central idea breaks down the process into two stages: i) bottom-up dictionary learning/encoding and ii) top-down feature reconstruction. In Bai et al. (2021), RoadNet-RT is introduced, a lightweight, high-speed CNNs architecture specifically designed for road recognition and segmentation. This architecture has been optimised for autonomous driving and virtual reality, where real-time processing speed is essential. In Nambu et al. (2022), MR is used to offer immersive and enriched experiences to the visitors of the Taxila Museum in Pakistan. It recognises museum artefacts using DL in real time and retrieves supporting multimedia information for visitors. To provide the user with the exact content, CNNs are applied to correctly recognise the artefacts. In Ko and Lee (2020), the authors propose an approach to improve 3D object recognition using a view-weighted CNNs (VWN); the hypothesis is that different projections of the same 3D object have distinct discriminatory characteristics. Therefore, some images are more meaningful than others for object recognition. In Zhang et al. (2020), researchers propose the establishment of a CNNs model for classifying geometric figures. This is achieved by optimising hyperparameters through random search, a random search technique. It optimises the image recognition and classification process.

5.2.4 Segmentation

In the subcategory of segmentation, semantic segmentation and image segmentation are incorporated. In the XR context, segmentation allows for the identification and classification of different objects and elements in a scene, giving the system a deeper understanding of the environment. This process is essential to augment the scene with relevant information, allowing a more natural and precise interaction between the user and the virtual or augmented environment. In Yi et al. (2019), researchers train a neural network for semantic segmentation in different scenarios to process images taking into account the category of the scene, yielding better results. In Zadeh et al. (2020), this study uses deep learning-based semantic segmentation in gynaecology to detect and locate a structure in an image at the pixel level, providing augmented information to the specialist. In Zou et al. (2020), the focus is on semantic segmentation and depth completion, central tasks for scene understanding that are vital in AR/VR applications. In Zhang et al. (2020), the authors propose a curriculum-based learning approach that seeks to bridge the domain gap in the semantic segmentation of urban scenes. This approach is based on first solving simpler tasks that allow inferring important properties about the target domain. Specifically, they learn global and local label distributions in the images, referencing superpixels. Once these properties are inferred, a segmentation network is trained and its predictions in the target domain are adjusted to fit the inferred properties. In Han et al. (2020), the focus is on challenges and solutions in the semantic understanding of 3D environments. The authors propose a sparse convolution scheme based on fragments to reuse neighbouring points within each spatially organised fragment. By implementing semantic and geometric 3D reconstruction simultaneously on a portable tablet device, the authors demonstrate a foundational platform for AR applications. In Tanzi et al. (2021), different architectures for semantic segmentation are compared in the task of identifying and locating a catheter in medical images and its possible application in AR. In Al-Sabbag et al. (2022), the authors propose a method for visual inspection of structural defects using an XR device. They allow for the interactive detection and quantification of defects using this device and image segmentation, which can overlay graphical information in a real environment. In Liu et al. (2022), a morphological diagnostic system is established for the detection of bone marrow cells based on a Faster R-CNN object detection model. The system is trained to perform pixel-level image segmentation and automatically detects bone marrow cells and determines their types. The information is visualised and integrated into a microscope with AR. In Zhang and Aliaga (2022), the authors use the CNN, RFCNet, which uses regularisation, fusion, and completeness to improve urban segmentation accuracy. Their approach uses urban structures as they often present regular patterns, resulting in improved accuracy. In Jurado-Rodríguez et al. (2022), an automatic procedure is proposed for the generation and semantic segmentation of 3D cars obtained from UAV-based photogrammetric image processing. The authors recognise that deep learning architectures, coupled with the wide availability of image datasets, offer new opportunities for 3D model segmentation. One of the most notable papers on segmentation is Hu and Gong (2022). The authors present a Lightweight Asymmetric Refinement Fusion Network (LARFNet), designed to perform real-time semantic segmentation on mobile devices. LARFNet is a CNNs that uses an asymmetric encoder-decoder structure, incorporating a depth-separable asymmetric interaction module (DSAI) in the encoding process and a bilateral pyramid attention module (BPPA) along with a multi-stage refinement fusion module (MRF) in the decoder. These modules facilitate effective information extraction and feature map refinement and fusion, respectively. In Park et al. (2020), instance segmentation is used through Mask R-CNN, coupled with marker less AR, to overlay the 3D spatial mapping of a real object in its surrounding environment. This 3D spatial information with instance segmentation is used to provide 3D guidance and navigation, assisting users in identifying and understanding physical objects as they move through the physical environment.

5.2.5 Optimisation

In the subcategory of optimisation, articles on performance, architecture, or peripherals. Advances in execution have allowed for smoother, more efficient, and adaptive AR/VR/MR/XR experiences, enhancing user performance and satisfaction and increasingly enabling access to these technologies in everyday settings. In Qu et al. (2023), a metric is proposed to evaluate quality based on a CNN, called LFACon, which surpasses last-generation metrics and achieves the best performance for most distortion types with reduced computational time. In Spagnolo et al. (2023), they utilise a custom Fast SR CNNs (FSRCNN) accelerator capable of processing up to 214 ultra-high-definition frames/s, with lower energy consumption without compromising perceptual visual quality, achieving a 55% energy reduction and a performance rate \(\times\)14 times higher than other devices. In Pinkham et al. (2023), they introduce a near-sensor processor architecture, ANSA, that supports flexible processing schemes and data flows to maintain high efficiency for dynamic CNNs workloads on devices, improving energy efficiency by using CNNs. In Luo et al. (2023), Sun et al. (2023), the authors focus on finding a CNNs configuration for point cloud classification, increasing accuracy, reducing computational cost, and execution time.

5.2.6 Movement estimation

In the subcategory of movement estimation, articles are framed that focus on tracking and movement recognition. In Dai et al. (2020), consisting of an object tracking motion and observation model, a "Motion Estimation Network" (MEN) is used. This network seeks the most probable locations of the target and creates a path additional to the target’s previous position. This optimises movement estimation by generating a small number of candidates close to two possible positions. These candidates are introduced into a Siamese network trained to identify the most probable candidate. Each candidate is compared to an adaptable buffer that is updated according to a predefined condition. To adapt to changes in the target’s appearance, a weighting CNNs is used, which adaptably assigns weights to the final similarity scores of the Siamese network using sequence-specific information, allowing it to identify and predict movement (Zeng et al. 2021). In Shariati et al. (2020), the authors introduce a solution to estimate Ego-mobility in a way that preserves user privacy. Ego-mobility is a key concept in robotic systems and augmented and virtual reality applications, referring to the ability to estimate one’s movement from sensory perception. They use a very low-resolution monocular camera and a CNN, named SRFNet, to retrieve Ego-mobility. The results of this study indicate that it is possible to robustly retrieve "ego-mobility" from very low-resolution images when camera orientations and metric scales are retrieved from inertial sensors and merged with the estimated translations.

5.2.7 Pose estimation techniques

The pose estimation techniques subcategory refers to articles that use position, from orientation to the distinction in XYZ coordinates and their degrees of freedom. In Kim and Lee (2019) the authors present a non-marker-based location method for live broadcasts. This approach uses two CNNs to automatically detect the target object and estimate its initial 3D pose. As a result, the 3D model can be aligned on a global map without the need for manual intervention or markers.

Thanks to the analysis of the articles included in the Table 5 under the category of execution, various uses of CNNs in AR/VR/MR/XR technologies are highlighted. Most of these works focus on aspects such as the identification and semantic segmentation of objects and context elements, as well as their tracking, fundamental aspects for determine and enrich the scene. In addition, emphasis is placed on the analysis of movement and pose of objects in both closed and open environments. This section of the research illustrates the role of CNNs in understanding and improving immersive reality experiences, emphasising their potential to provide richer, more intuitive experiences adapted to the user’s context, underscoring their importance in the development and advancement of extended reality technologies.

5.3 Creation classification

In Table 6, we can see the articles that address the use of CNNs in AR/VR/MR/XR from the point of view of creation. Key topics in this category include automatic 3D model generation, texture synthesis, character animation, and creating realistic virtual environments. In this classification you can see in Fig. 11 a more uniform distribution, however 3D and reconstruction stand out. CNNs have been used to speed up and simplify the content creation process, while allowing for more detailed and personalised results.

Table 6 Creation classification

Advances in this area have driven the development of content creation tools and platforms for VR/AR/MR/XR, allowing more users and developers to access and create immersive experiences increasingly faster and of higher quality. Each of the subcategories in which the articles in Table 6 were catalogued are described below:

Fig. 11
figure 11

Percentage distribution of subcategories of creation in XR technologies. Own elaboration

The Fig. 11 shows that 26% of the articles focus on the creation of virtual environments, followed by 23% on the reconstruction of 3D scenes and objects. The generation of 3D models accounts for 14%, while mapping and lighting simulation take up 10% and 9% respectively. Texture generation is the smallest subcategory with 5%. This indicates that most of the research in the creation category is focused on developing realistic and detailed environments and reconstructions to improve functionality and user experience in XR technologies.

5.3.1 Texture

In the texture subcategory, articles are grouped that are associated with the generation and synthesis of textures. This generation is a crucial component in creating compelling and realistic AR/VR/MR/XR environments. CNNs have been employed to model and generate realistic textures that adapt to the environment and virtual objects. The use of CNNs not only improves the visual quality and realism of textures but can also optimise the performance of applications by allowing for the efficient generation and rendering of textures. In Liu et al. (2021), an improvement in texture generation in video synthesis is proposed, taking into account fine-scale details, such as wrinkles in clothing that depend on posture. The method is based on the combination of two convolutional neural networks. With posture information, the first CNNs predicts a dynamic texture map containing high-frequency details that are temporally coherent. The second CNNs conditions the generation of the final video on the temporally coherent output of the first CNN. In Rodriguez-Pardo et al. (2019), the detection and replication of repetitive texture patterns from a single image are proposed. This technology has significant implications in graphic processes, where repetitive texture patterns are a key tool for creating realistic visual representations. A relevant point is its ability to determine the minimum repeated pattern size in an image and replicate it so that the resulting image is as close as possible to the original.

5.3.2 Mapping

In the subcategory of mapping, articles use mapping as a component of XR. Mapping allows for the construction of environment models that can be used for the location and orientation of virtual objects. CNNs have proven effective in mapping, allowing for precise and robust registration of objects and the environment. This results in a better integration of virtual and physical elements and a more consistent and compelling experience for the user. In Yang et al. (2022), the authors introduce a CNNs model called SDF-SLAM; this model can carry out the camera’s position estimation in a broader indoor environment and can also perform depth estimation and semantic segmentation in monocular images, thereby constructing a comprehensive and precise three-dimensional map. In Liu and Miura (2021), the authors propose a fundamental technology for augmented reality, RDMO-SLAM, a real-time vSLAM that combines RDS-SLAM (Liu and Miura 2021) and Mask R-CNN. RDMO-SLAM estimates the speed of each feature point and uses this information as a constraint to minimise the influence of dynamic objects on tracking, reducing error and increasing precision in simultaneous localisation and mapping processes. In Su and Yu (2022), CNNs are used to enhance deep image reconstruction by working with dense three-channel colour images (red, green, and blue), focusing on the transformation of multi-layer image invariant features.

5.3.3 Reconstruction

In the subcategory of reconstruction, articles are grouped by the frame of the reconstruction of environments or objects. Reconstruction allows for the creation of virtual replicas from real-world data. CNNs have been successfully applied in reconstruction, enabling the generation of precise and detailed 3D models that can be used in a variety of applications, from VR content creation to augmented reality for the construction industry. In Bi et al. (2020), CNNs are used to extract information about the materials, geometry, and lighting of an object from a single RGB image and reconstruct its appearance. In Song et al. (2022), ACINR-MVSNet is introduced, a framework for multi-view stereo reconstruction that features group adaptive correlation and implicit neural enhancement to refine the depth map and reconstruction guided by a corresponding reference image, achieving the recovery of finer details. In Manni et al. (2021), an Android application is proposed that tracks the phone’s position relative to the world, captures an RGB image for each exposed object, and estimates the scene’s depth, as well as a server program that classifies the captured objects, retrieves the corresponding 3D models from a database, and estimates their position, rotation, and scale in AR. In addition to this, in reconstruction, the process of joining images (image-stitching or image mosaic) can be considered, as this process creates 360\(^{\circ }\) mosaics in AR/VR. Despite its relevance, maintaining homogeneity between the input image sequences during the stitching is a significant challenge. In Chilukuri et al. (2021), the authors propose a methodology for image stitching, called left-right stitching unit (L,r-Stitch), which handles multiple non-homogeneous image sequences to generate a homogeneous panoramic view. L,r-Stitch consists of a CNN, named l,r-PanoED. The l,r-PanoED encoder extracts semantically rich feature maps from the inputs to perform the stitching in a broad panoramic domain, while the decoder reconstructs the output panoramic view from the feature maps.

5.3.4 Environment

In the subcategory of environment, articles utilise CNNs to understand and model the user’s environment, allowing for a smoother and more natural interaction by incorporating elements such as depth Li et al. (2022) or scene lighting. By better understanding the environment, XR applications can offer more immersive and safe experiences. In Ye et al. (2022), there is an effort to improve the efficiency and accuracy of stereo-matching algorithms. The authors propose a stereo network that uses the prior consistency of local disparity to enhance the performance of real-time disparity estimation. They conduct an initial disparity estimation calculated by a lightweight pyramid matching network and introduce two new modules, the Spatial Consistency Refinement (SCR) module and the Temporal Consistency Refinement (TCR) module. The SCR module utilises high-confidence predictions from sparse neighbourhoods to refine the less reliable regions of disparity. This module incorporates a single-layer dynamic local filter to adapt the propagation to contents, which enhances the disparity quality without significantly increasing the computation and memory burden. The TCR module, on the other hand, is used to refine the disparity estimation of consecutive frames based on disparity consistency over time. In Wu and Wang (2022), the Rich Global Feature Guided Network (RGFN) is proposed for monocular depth estimation using CNNs and Transformer (Vaswani et al. 2017).

5.3.5 Light

In the light subcategory, articles that modify or interpret light are grouped. Light shaping is essential to creating compelling virtual and augmented reality experiences. CNNs have been used to estimate ambient illumination and consistently apply it to virtual objects. This improves immersion by making virtual objects look as if they were present in the scene, with the same lighting as the physical environment. In Chalmers et al. (2021), the authors present a method for ambient lighting reconstruction, essential for improving spatial presence in AR and MR applications. This illumination is encoded as a reflection map generated from a conventional photograph. The method uses a stacked CNNs to predict roughness and light levels from a low dynamic range photograph with a limited field of view. The reflection maps are predicted with different degrees of roughness, corresponding to those of the virtual objects that are rendered, from the most diffuse to the brightest.

5.3.6 Three dimensional modelling

In the three dimensional modelling subcategory, some articles allow a reconstruction of 3D models or that use this domain for the reconstruction. CNNs have proven to be effective in generating and manipulating 3D models. This includes generating 3D models from 2D images, synthesising realistic 3D models, and manipulating 3D models for animation or interaction. The ability of CNNs to work with 3D data has expanded the possibilities of XR, allowing the creation of richer and more realistic content. In Amara et al. (2022) the authors use a CNN, called O-Net, designed to automatically segment COVID-19 infected chest CT scans. This information is part of a 3D modelling process that is used in a virtual reality platform, COVIR, to visualise and manipulate 3D lungs and segmented COVID-19 lesions. In Ye et al. (2020), HAO-CNN is proposed, a network to reconstruct 3D hair models from a single image, it presents an advance in detail and direction, however its authors see a challenge in the reconstruction of curly hair and receding hair that presents occlusions.

5.3.7 Point cloud

In the point cloud subcategory, all the articles are based on the generation of a point cloud, carry out a reconstruction of both objects and the environment. Point cloud is a popular method for representing 3D data in the field of XR. CNNs have been used to process and manipulate point clouds, enabling object identification and classification, pose estimation, and 3D reconstruction. In Zhao et al. (2022) a context-aware deep network is presented, called PCUNet, this network generates point clouds in a stepwise manner, from coarse to fine. PCUNet employs an encoder-decoder structure, with the encoder following a shape-relation convolutional neural network (RS-CNN) design, and the decoder consisting of fully connected layers and two stacked decoder modules to predict point clouds. complete. Thus, achieving more accurate models with related point clouds. In Jia et al. (2021), it generates a point cloud by doing a geometric decomposition and using CNNs to learn about the conformation of compressed 2D points that can be propagated to 3D point cloud frames.

6 Discussion

This review considers articles based on the potential use of CNNs in the fields of XR. In this work, the reviewed the related works from 2010 to 2023, with the aim of capturing a transcendental stage in the development and evolution of both CNNs and AR/VR/MR/XR. This adjustment is based on the rapid progression and swift adoption of these technologies observed between 2021 and 2022, characterised by significant advances in the capabilities of CNNs and their implementation in AR/VR/MR/XR. In 2023, research in this area saw a decline. Furthermore, it was anticipated that including the years 2010 and 2011 in the review would increase the number of publications before 2015, providing a more comprehensive and enriched overview of advancements in these areas over time. As a result of the search and classification in the selected databases, it was evident that during the period between 2010 and 2015, no applicable articles were found in the domains of this research. It is possible that expanding the database selection might uncover articles from this period. However, several reasons are proposed that might explain the lack of research in this area during those years: between 2010 and 2015, CNNs were still in an early stage of development and adoption. Although they were already being used and popularised in some computer vision applications (Schmidhuber 2015), their application in XR was not common. Additionally, during that period, XR devices were still under development, and the available hardware could not support complex CNNs models with high resource demands. One of the most popular CNNs of this period took between 10 to 30 min to classify 8 million pixels using four GPUs (Ciresan et al. 2012) (Fig. 12).

Fig. 12
figure 12

Publications by category by year. Own elaboration

It’s possible that during those years, researchers were more focused on other aspects of CNNs and XR technologies, such as algorithm optimisation and the development of specific applications, rather than exploring the intersection of both technologies. As CNNs and XR technologies evolved and matured, research in these fields became more specialised. Events such as Facebook’s transition to Meta or the crisis caused by SARS-CoV-2 might have drawn more attention to the intersection of these technologies.

Among the various areas of CNNs use in different XR technologies, the "execution" category stands out as the one that has received the most attention in research and development due to the number of works in this area. The main reason could lie in the inherent nature of CNNs, designed for image analysis and processing. The execution category encompasses image processing, and these networks have demonstrated exceptional efficiency and precision in image and object recognition and classification. They highlight in segmentation, pose, and movement, which is essential for a smooth and consistent XR experience. Being able to identify and track objects and shapes in real time, CNNs enable XR systems to understand and respond to the user’s environment and actions, ensuring a more natural and precise immersion.

6.1 Impact of CNNs in AR, MR, and XR: interaction, creation, and execution

When describing the technologies of VR, AR, MR, and XR separately, we can say that in the case of VR, CNNs have been used in interaction to provide precise tracking of the user’s hands and body, as well as to recognise and process gestures and actions in real time. This enhances immersion and interaction in entirely virtual environments. In terms of creation, CNNs have been applied in generating detailed and realistic virtual environments from images or data captured from the real world, as well as in animating virtual characters and objects. Regarding execution, CNNs have contributed to optimising the performance of VR applications, such as reducing latency and increasing frame rate, which is crucial for an optimal user experience.

In AR, CNNs have been used in interaction for object recognition and understanding the physical environment, allowing precise and coherent integration of virtual elements in the real world. In creation, CNNs have been applied to generate realistic textures and 3D models from images or data captured from the environment. In execution, CNNs have aided in developing more efficient AR applications with lower resource consumption, enabling their use on mobile devices and systems with limited hardware.

In MR, which combines elements of AR and VR, CNNs have been used in interaction to enhance the fusion of virtual and physical environments, providing a more immersive and coherent experience. In creation, CNNs have been applied in generating 3D models and textures that adapt to the physical environment and lighting conditions. In execution, CNNs have been employed to optimise performance and real-time adaptation of MR applications, considering environment characteristics and hardware capabilities.

Regarding XR, which encompasses all the previous technologies, CNNs have been used in interaction to provide more natural and accessible interfaces, such as voice and gesture recognition, or adapting the information presented to the user based on their context. In creation, CNNs have been applied to generate personalised and adaptive content and experiences. In execution, CNNs have been employed to optimise and adapt XR applications to various devices and platforms, ensuring a smooth and efficient experience across a wide range of scenarios and applications.

The inherent limitations of this study include, first and foremost, temporality. While this study offers valuable insights into the current applications and potential of CNNs in XR technologies, it’s important to note that both fields are rapidly advancing. The relevance and accuracy of the information could be affected by this temporal limitation. Furthermore, the study relies on the consulted databases, which means there’s a potential exclusion of relevant literature or research not indexed or contained in other databases not considered. Future research needs to be aware of these limitations when interpreting and applying the findings of this study.

The quality assessment process led to the exclusion of several studies that did not meet our criteria, mainly due to methodological shortcomings or studies that explicitly did not mention that their research could be applied in XR technologies. This filtering ensured that our analysis was based on studies that provide reliable and unbiased data, thus reinforcing our conclusions about the effectiveness and applicability of CNNs in extended reality technologies.

6.2 Analysis of distributions

In Fig. 8, the execution category is the most representative with 43%. This could be due to the fact that most of the uses of CNNs in XR are in real time and while they are running. Segmentation and recognition are critical areas to ensure that XR applications work efficiently and effectively on a variety of devices and situations. Interaction takes up 33%, highlighting the importance of improving how users interact with XR technologies through gestures, eye tracking, and hand detection. Creation, although crucial, has a lower focus at 24%, which might indicate less research into content generation and detailed virtual environments. In Fig. 9, the subcategory of Human-Computer Interaction is the most significant with 41%. This could be because improving the naturalness and fluidity of interactions is fundamental to making XR technologies intuitive and accessible. In Fig. 10, recognition is the most representative subcategory with 29%. This could be due to the need for advanced understanding of the environment to improve the functionality of XR applications. Segmentation (21%) and object detection and tracking (18%) are also critical areas, facilitating the identification and classification of objects within a scene for more detailed and accurate interactions. In Fig. 11, the creation of virtual environments is the most prominent subcategory with 26%, followed by the reconstruction of 3D scenes and objects with 23%. Together, they account for 50% of the creation research, which could be due to the crucial need to develop more realistic and detailed environments, as well as to accurately represent the virtual environment. The distributions in these graphs may reflect research priorities and approaches to the use of CNNs in XR technologies. The execution category predominates due to the need to segment and recognise the real-time environment in XR applications. Interaction focuses on developing more natural interfaces and interaction methods, with a particular emphasis on gestures and human-computer interfaces due to their importance for accessibility and usability. Creation seeks to improve the quality and realism of virtual content, with a strong focus on environments and reconstruction to provide more detailed and accurate immersive experiences. These integrated efforts are essential to advance the functionality and user experience of extended reality technologies.

7 Conclusions

In the work carried out, the research question has been answered: in the context of XR, CNNs are used to enhance the user experience by processing and analysing images, movements, signals, and videos, even in real-time. From this literature review, it can be identified that one of the uses of CNNs in the field of VR is to improve image quality and performance in generating virtual worlds by tracking the user’s gaze, optimising the points the user is focusing on. In AR, CNNs are used for object and marker recognition, allowing the overlay of virtual information onto the real world. In XR, CNNs are used to combine real elements with virtual reality, creating more realistic immersive experiences. In MR, CNNs are used to recognise and track objects in the real world and overlay virtual content on them. In summary, CNNs are essential to provide a richer and more realistic reality experience in these fields.

Table 7 Key takeaways and implications for different stakeholders

Additionally, it was concluded that CNNs can be classified within the VR/AR/MR/XR domain into three major groups: interaction, execution, and creation.

In terms of interaction, CNNs can be employed for recognising gestures and movements of users in VR, AR, and MR, enabling more natural and accurate interaction with virtual content. They are also utilised for tracking head and eye movements to adapt the perspective in real-time in VR.

In execution, CNNs are harnessed for object and marker recognition in the real world for AR and MR. They further provide a more realistic and smooth experience in VR, AR, and MR. This encompasses the use of CNNs to generate more detailed virtual worlds, enhance image quality, and reduce latency. They are also used for tracking objects and individuals in the real world for AR and MR, facilitating precise overlay of virtual content. Additionally, they are applied to optimise performance and image quality in VR, heighten efficiency in virtual world generation, and minimise loading time and latency.

Regarding creation, CNNs are employed for the reconstruction of objects and 3D scenes in VR, AR, and MR. This enables the crafting of more detailed and realistic virtual content and also allows for the accurate creation of three-dimensional models of the real world for augmented and extended reality applications. They are even used for real-time scene reconstruction captured by cameras to enhance the accuracy of virtual content overlay in AR and MR. Furthermore, they can be utilised for the reconstruction of objects and 3D scenes from captured images and videos, facilitating the creation of more precise immersive experiences.

The implications of our findings extend across various domains, influencing researchers, developers, and practitioners involved in the technology and application of extended reality. This detailed breakdown provides a clear roadmap for each stakeholder group to harness the potential of CNNs in their respective fields, thus enhancing the utility and impact of the review. The Table 7 offers a detailed view of the use of CNNs in extended reality for each Stakeholder, focusing on the proposed classification, specifically on interaction, execution, and creation.

The future work horizon of this research includes a detail of CNNs architectures, the analysis of advanced Deep Learning methods, such as Generative Adversarial Networks (GAN) (Bau et al. 2018) among others, in the field of XR. This exploration would enrich the study of artificial intelligence techniques in XR. It is vital to underline the importance of areas such as semantic segmentation, HCI and 3D reconstruction in the application of CNNs in XR. Its detailed research not only remains a priority, but also opens doors to more specialized research in each subdomain. This work could be further extended by adding new databases and delving into the aforementioned areas.