1 Introduction

Several domains such as education, medicine, and architecture use Virtual Reality (VR) (Taxén and Naeve 2002; Kurumi and Morikawa 2016; Pourchera et al. 2018; Nazligul et al. 2017; Racz and Zilizi 2018; Hayes et al. 2018). Software engineering researchers have recently started to study the benefits of VR for software developers and researchers (Gulec et al. 2018; Zirkelbach et al. 2019; Akbulut et al. 2018; Sharma et al. 2018). Ruvimova et al. found that VR reduces unnecessary distractions faced by software engineers (Ruvimova et al. 2020). Researchers have also started to explore how programmers can communicate and collaborate in VR (Bierbaum et al. 2001; Vincur et al. 2017; Aleotti et al. 2004).

For programmers, one of the most challenging and time-consuming tasks is comprehending source code (Steinmacher et al. 2016; Al-Saiyd 2017; Bacher et al. 2017; Khomokhoana and Nel 2019). Current solutions to address these problems include visualizations to display code, enhanced syntax highlighting, and code comprehension through games (Merino et al. 2018; Oberhauser and Lecon 2017; Wiese et al. 2019). Programmers also find that collaboration leads to better code comprehension (Arisholm et al. 2007).

The COVID-19 pandemic has forced many programmers to work from home. Brynjolfsson et al. has estimated that 50.2% of the workforce was remote during 2020 (Brynjolfsson et al. 2020). At the end of 2020, Upwork Inc. estimated that 26% of the total workforce would be working remotely in 2021, and an estimated 22% would continue to work remotely in 2025 (Upwork Study Finds 22% of American Workforce Will Be Remote by 2025: Upwork 2020). Ralph et al. found that employee well-being and productivity are closely related and are adversely affected by the sudden shift to work from home (Ralph et al. 2020). They suggest that employers should focus on improving employee well-being and the ergonomics of the home office space, such as providing proper office furniture.

In recent years, there has been a renewed interest in digital spaces. A metaverse is a virtual world that allows multisensory interactions through VR and Augmented Reality (AR) (Mystakidis 2022). Link Trainer, an analog flight simulator, is often considered the precursor to the present-day metaverse (Jeon 2015). Various metaverse platforms have already started offering virtual office spaces (Future of remote work? here’s the reality of working in the metaverse 2022). Facebook launched Horizon Workrooms, a business meeting space accessed through their Oculus series of VR headsets (Workrooms: VR for business meetings n.d.). These developments and the future possibilities heighten the need to understand the programmer’s experience during source code comprehension in VR. As the software development industry is exploring the use of VR for workspaces, we academics should explore this space more and provide research outcomes that further advances our understanding of this relatively new paradigm.

In this paper, we define the human experience of programmers during code comprehension as to how they experience frustration, mental effort, and physical effort. We study graduate student programmers comprehending source code in both a desktop computer setup and a virtual environment. We recruited 26 programmers to participate in this study. The program comprehension tasks included asking programmers to read source code, provide the code output, and write a plain English text summary of the code. In the study, 13 programmers worked in VR, and the other 13 programmers used the desktop computer setup. We asked programmers in both studies to complete as many comprehension tasks as possible from 8 different code snippets in 40 minutes. We found that programmers faced many challenges comprehending source code in VR. Programmers in VR reported lower concentration levels and higher levels of physcial demand, and effort. However, we found that the VR programmers’ perceived productivity and measured productivity was not significantly different from that of the desktop programmers.

This paper is an extension of the Program Comprehension in Virtual Reality paper (Dominic et al. 2020a). In addition to the details discussed in the publication, we added four more research questions, provided statistical analysis for eight more variables, performed Benjamini-Hochberg adjustment for the p-values, added Cohen’s d effect sizes, and Spearman correlation analysis. We expanded the paper from four pages to 24 pages, including more details about the experiment setup and an in-depth discussion of the results and lessons learned while conducting this experiment. We discuss the results in the context of the added data and provide recommendations for future research in virtual environments for software engineering.

2 The Problem

We address a gap in the current program comprehension literature by exploring the productivity and perceived productivity of programmers comprehending source code in VR. As a community, we need to understand if the benefits of using VR as a tool for software developers outweigh the risks (e.g., fatigue). To better understand the human experience of comprehending source code in VR, we needed to first expand upon our previous work on program comprehension in VR (Dominic et al. 2020a; b). It is important to note that as a community, we are starting to explore various uses of VR (Ruvimova et al. 2020; Merino et al. 2018; Elliott et al. 2015a). For example, Merino et al. demonstrated the use of VR for explaining software architecture in the form of city buildings (Merino et al. 2018). Shepherd et al. studied the reduction of distractions in VR when used at the workplace (Ruvimova et al. 2020). Parnin et al. explored the affordances, possible applications, and challenges VR brings to software engineering (Elliott et al. 2015a).

This paper advances the SE research community’s understanding of program comprehension in VR. We have not taken the risks that other fields have for years. Many fields have found benefits to using VR for training and education (Sattar et al. 2020; Martin-Gutierrez et al. 2017; Wang et al. 2018; Hayes and Johnson 2019). Other fields have found benefits from using VR for collaborations (Hoppe et al. 2018; Guerin and Hager 2017). Unfortunately, there is still little work on software engineers working in VR. Even though we recognize program comprehension as one of the most significant factors for success, we have not spent the time to explore the potential use of VR in enabling program comprehension.

Remote work has become significantly prominent in the software engineering field during COVID-19 (O Connor et al. 2022). This means some programmers never enter their company’s physical office location. Remote programmers use Slack, Microsoft Teams, and various other tools for collaboration. Unfortunately, this solution is problematic. Remote programmers still find it challenging to collaborate with these tools (Miller et al. 2021). The more geographically dispersed a team gets, the more negative impact it has on team performance and software quality (Nguyen-Duc et al. 2015).

Various solutions, such as using agile practices, have been introduced in the past (Hossain et al. 2009). One solution is to use VR to bring programmers into the same environment for collaboration. If code comprehension and the human experience are similar or better in VR, it is worth further exploring for programmers. Software development teams could use VR as a tool for future training and onboarding of new programmers to software development teams. They could also use VR to facilitate pair programming sessions where an experienced programmer collaborates remotely with a new programmer to comprehend source code and program together.

3 Background and Related Work

In this section, we discuss the background and related work on program comprehension environments, virtual reality, VR in software engineering, and the human experience during comprehension.

3.1 Program comprehension environments

Many program comprehension environments have been built and explored (Williams et al. 2000). The focus has been on computer environments that display code in a unique fashion. Panas et. al proposed the software city metaphor to visualize software as a city for easy understanding of code structure (Panas et al. 2003). These visualizations were made on a 2D screen and was considered a high cost activity during its introduction in 2003. With the introduction of relatively affordable VR headsets such as the Oculus Rift, the research community replicated the software city metaphor in VR. Programmers found gesture based translation, rotation and selection of these software cities highly usable in virtual environments (Fittkau et al. 2015). Many researchers have developed visualizations of source code that help programmers quickly understand the source code at a high level (Kircher et al. 2001; Baheti et al. 2002). There are a few studies on the physical environment that programmers work in for comprehension (Schenk 2018). Johnson et al. studied distractions in the workplace. They found that programmers who were distracted and had to context switch were less productive than their counterparts (Johnson et al. 2003). However, Grubert et al. found that VR environments limit potential distractions in the workplace (Grubert et al. 2018). These characteristics of VR indicate that there is potential for VR to be used as an effective code comprehension platform.

3.2 Virtual Reality

Virtual Reality has entered many domains outside of software engineering. Many educators focus on teaching VR development to students as young as 7-14 years old and have found success in developing VR curriculum for students of all ages (Häfner et al. 2013). Some children with autism spectrum disorder responded positively towards the use VR and are able to learn new thing when information is presented using VR (Parsons and Cobb 2011). Hoffmann et al. desinged a VR system used to provide immersive tours of remote laboratories from a virtual theater (Hoffmann et al. 2016). This system provides the users a different perspective and opportunity to interact using natural means (using their hands and gestures). Pellas et al. conducted a literature review of 41 research studies using VR for education published between 2009 and 2019 (Pellas et al. 2020). Several of them reported positive learning outcomes with the use of VR. Students in one study were able to fix real-world electrical circuits using the knowledge from VR training. However, some studies also reported problems with using VR, such as a 3D interface that is not always well suited for learning applications and limited collaboration capabilities.

VR is also used in robotics and manufacturing to simulate as well as evaluate the behaviour of various robotic systems and humans. Matsas et al. used VR to evaluate the behaviour of humans while interacting with robots to perform manufacturing tasks (Matsas and Vosniakos 2017). Many fields such as medicine, architecture, and entertainment employ VR technologies. Psychotherapists use VR to help clients overcome phobias, such as the fear of heights (Freeman et al. 2017). Hodges et al. demonstrated that participates exhibited acrophobia when exposed to increased heights in VR (Hodges et al. 1995). VR is used in medical schools to train students on surgical procedures (Ullrich and Kuhlen 2012). Architects use VR to create and model buildings. Architects can interact with their clients, discuss design, and preview a design before building (Bozgeyikli et al. 2016; Sampaio 2018; Johansson et al. 2014; Graham et al. 2018; Kreutzberg 2015). Businesses use VR for meetings with remote clients and employees (Boughzala et al. 2012). Pearlman et al. found examples of use of virtual worlds by companies such as IBM, Starwood and Forterra for product development and the American Cancer Society for annual fund raising events (Pearlman and Gates 2010). Not only has VR entered the workforce, but VR has also continued to become a common household technology as more gaming companies release VR games (Herz and Rahe 2020; van Berlo et al. 2021).

3.3 Virtual Reality in Software Engineering

VR is a realistic interactive environment created by computer graphics. Many research areas, including gaming, training, education, therapy, and more, have implemented VR systems (Lohse et al. 2014; Laver et al. 2017; Bordegoni and Ferrise 2013). Software engineering researchers have recently started to explore the uses of VR for programmers. Elliott et al. displays code as labeled piles on the floor and allows programmers to navigate the code piles by grabbing them for closer inspection to understand the functionality better (Elliott et al. 2015b). Takala et al. demonstrated the use of virtual environments to provide help with learning how to program software systems (Takala 2014). This environment included “3D User Interface Building Blocks,” making it easier for programmers to interact with various peripherals.

Fittkau et al. researched “software cities” (Fittkau et al. 2015), which are 3D representations of the packages and classes in a software system. The city metaphor visualizes the object-oriented code as a city. The buildings represent classes, and the districts represent packages. The visual artifacts of the city represent the software metrics. Romano et al. found that programmers using the city metaphor did better during code comprehension (Romano et al. 2019a). Programmers using VR were able to complete an average of 5.78 successful code comprehension tasks, whereas those who used a desktop completed only 4.75 code comprehension tasks.

3.4 Human Experience During Comprehension

The human experience during source code comprehension is complex (Al-Saiyd 2017; Khomokhoana and Nel 2019; Tvarozek et al. 2016). Research has found that programmers have similar human experiences while comprehending source code, such as frustration (DeLine et al. 2005; Fritz et al. 2014). Researchers have used functional magnetic resonance imaging (fMRI) machines to understand better how the brain comprehends code (Floyd et al. 2017; Siegmund et al. 2014; Castelhano et al. 2019). Peitek et al. found that code comprehension activated the brain regions related to working memory, attention, and language processing (Peitek et al. 2018). Siegmund et al. found lower activation of brain areas when using semantic cues for code comprehension (Siegmund et al. 2017).

Although there are similarities in the way programmers comprehend source code, various factors like the complexity of the code and programmer experience affect code comprehension. Cognitive complexity is a complexity measure developed to control the complexity of Fortran programs. Various studies have shown the correlation between cognitive complexity and source code understandability during code comprehension (Campbell 2017; Muñoz Barón 2019). We examine programmers’ perceived complexity and understanding of the source code while assessing their perceived productivity during our study. These factors will help us obtain a qualitative understanding of the effectiveness of using virtual environments for code comprehension. LaToza et al. compared the human experience of code comprehension in experts and novice programmers (LaToza et al. 2007). They found that experts explained the root cause of the issue, whereas novices could only explain symptoms. They also found that novices spent more time reading more methods and functions whereas experts did not. Romano et al. found that using city metaphor visualization in VR positively affected the programmers’ feelings and emotions during code comprehension (Romano et al. 2019b).

Concentration and attention are often considered to be largely identical (Petermann 2011). However, various experts also consider concentration to be a part of the attention model (Schweizer 2006). Attention is the neurobiological concept that implies the concentration of mental power of a subject upon an object by careful observation and listening (Ma et al. 2002). Sorqvist et al. defines concentration as the shield that acts against distraction under increased task loads (Sörqvist et al. 2016). Various studies have shown decreased performance in reading comprehension with increased load (Kiger 1989; Keating 2008). We use the overall task load to indicate concentration during code comprehension. Our research questions and data analysis examines how task load affects code comprehension in both groups of programmers.

4 Virtual Reality Study Design

In this section, we describe our research objective, research questions, methodology, surveys, data collection, subject application, participants, statistical test, equipment, threats to validity, and replication package.

4.1 Research Questions

Our research objective is to better understand the human experience and how programmers comprehend source code when working in a VR environment. Therefore, we ask the following Research Questions (RQs):

  • RQ1 To what degree does code comprehension in VR and desktop settings differ in terms of concentration?

  • RQ2 To what degree does code comprehension in VR and desktop settings differ in terms of mental demand?

  • RQ3 To what degree does code comprehension in VR and desktop settings differ in terms of frustration?

  • RQ4 To what degree does code comprehension in VR and desktop settings differ in terms of physical demand?

  • RQ5 To what degree does code comprehension in VR and desktop settings differ in terms of perceived productivity?

  • RQ6 To what degree does code comprehension in VR and desktop settings differ in terms of measured productivity?

The rationale behind RQ1 is that concentration is a weak predictor in reading comprehension. According to Wolfgramm et al, it may be a stronger predictor in program comprehension (Wolfgramm et al. 2016). Answering RQ2 will help understand the difference in the intellectual effort necessary to comprehend code in VR versus on a desktop. The rationale behind RQ3 is that participants in the VR group must use extra equipment to perform code comprehension. We are interested in observing how this additional requirement affects the participant’s frustration level and code comprehension performance. RQ4 tests the differ- ence in physical demand between the two programming environments. The rationale behind RQ5 is that comprehension influences productivity (Rastogi et al. 2017). By testing the perceived productivity in different environments, we can explore the environment’s influence on comprehension to optimize working conditions for programmers. Finally, by answering RQ6, we can quantify the productivity differences between the two code comprehension environments and could begin to understand the viability of using VR for code comprehension.

4.2 Methodology

We designed a research methodology based on related work in program comprehension (Good and Brna 2004; Rodeghero et al. 2014a; Abbes et al. 2011) to study individual participant’s reading and comprehending source code. We had two groups of participants. One group of participants wore a VR headset for the entire study. The other group of participants sat at a desk and used the desktop computer monitor to read the code and complete the comprehension tasks.

Participants sat in front of a desktop computer setup with a monitor, or they sat down in front of a desk and put on a VR headset. We assigned participants to desktop and VR groups alternating as they came in for the study. For the VR participants, we first introduced them to the VR system. The system consisted of a virtual environment with a Windows desktop mirrored in VR (see Fig. 1). The desktop mirror allowed participants to have full interaction with a computer. For more information on the software used, see Section 4.10. For the VR participants, we introduced them to the VR headset, the keyboard, and the mouse apparatus used for the experiment. We then assisted them with wearing the VR headset and made sure that they were physically comfortable. Participants were allowed to wear the VR headset over their corrective glasses when applicable. During the development of the VR system, we noticed that the virtual hands might not always overlap with the participant’s hands. This was an unfortunate artifact introduced by combining HTC Vive and Leap Motion controllers. To remedy this situation, we designed a keyboard alignment software that the participants could use to align the virtual keyboard with their hands in VR. The keyboard alignment software used only the arrow keys to move and rotate the keyboard to make it easier to use. Participants could bring up and close this software using a space bar and “shift” key combination. They could toggle between moving and rotating the keyboard using the “x” key.

Fig. 1
figure 1

The virtual reality environment used by the programmer to complete source code comprehension tasks. This figure shows the desktop mirror as the programmer comprehended source code

Once the participants were in the virtual environment, we introduced them to the keyboard alignment software and gave them as much time as required to make keyboard adjustments. However, this activity did not take more than 5 minutes for any participant during this study. The methodology is the same for both groups of participants throughout the rest of this section.

We provided the participants with 8 different Java projects. The presentation order of these projects was randomized for each participant. The participant read one piece of code at a time. An image of the source code was presented during this time, accompanied by two text boxes below the image. They then provided the output and an English text summary of the code. The participants had 40 minutes to complete the study. The output was collected using the first text box below the source code. Participants could use the second text box below the source code to record their code summary. There were no formatting requirements for the code summary. During the study, the participants had internet access and could use any internet resource (e.g., Stack Overflow) in VR and desktop setup.

After the study, the participants completed a post-experiment survey, capturing their experiences during the code comprehension study. This survey captured data about how mentally and physically demanding the task was, along with more critical contributions to the overall task load. We also surveyed participants about their experiences using VR.

4.3 Surveys

We conducted three surveys. We collected a (1) demographics survey that asked questions about programming experience, including years of programming and Java experience. At the end of the study, we asked all participants to fill out the (2) NASA TLX survey and a (3) post-experiment survey.

The NASA TLX survey is a multi-dimensional questionnaire designed to determine the task load of a participant performing a task (Hart and Staveland 1988; Hart 2006). Various research communities have adopted it as a reliable measure of task load (Afridi and Mengash 2020; Cecil et al. 2021). Said et al. demonstrated that a modified NASA TLX survey accurately reflects the assumed influences of various covariates on perceived workload (Said et al. 2020). This survey measures mental demand, physical demand, temporal demand, performance, effort, and frustration via the question listed in Fig. 2. The participants record their answers on a 21 gradient scale. These gradient scales are converted into standard quantities using the official NASA TLX application after the experiment (TLX @ NASA Ames - NASA TLX App. NASA n.d.). The calculations provide an adjusted score for each category and an overall task load index. Figure 3 shows sample measures obtained through NASA TLX survey. All surveys and anonymized data are available in our online appendix (see Section 4.10).

Fig. 2
figure 2

A screenshot of the NASA TLX survey questions asked during the post-experiment survey. It lists all six questions and the 21 gradient scale

Fig. 3
figure 3

Sample adjusted scored and overall task load score from the official NASA TLX application

4.4 Data Collection

We collected screen recordings of all participants, and collected both the function outputs and source code summaries. We collected the participants’ demographics and self-reported programming experience before the study. After the study, participants completed a questionnaire where they provided a self-reported productivity score. Finally, participants reported their mental demand, physical demand, temporal demand, performance rating, effort, and frustration through the NASA TLX survey during the experiment. We calculated the overall task load using the data collected using the NASA TLX survey. Participants also filled out the simulator sickness questionnaire and the IPQ presence questionnaire (Kennedy et al. 1993; Regenbrecht and Schubert 2002). Since we could not establish a control group that also used VR, we decided not to use the reponses for data analysis. After the study, the second author analyzed the output and code summary provided by the participants. First, the author analyzed the correctness of the output, followed by a review of the code summary in case the output was incorrect. We determined that the source code was successfully comprehended if the output or the summary was correct.

4.5 Subject Application

The subject applications consist of eight Java projects. We selected these projects because they are common methods and functions used in programming paradigms. We selected methods that return the median from an array of integers, maximum and minimum values, compute the factorial of a number, return employee information using encapsulation, remove duplicates in a list, return the length of a word, demonstrate polymorphism, and determine if a string is a palindrome. The first two authors collaborated to collect the subject applications from various coding platforms such as Leetcode and W3Schools (The world’s leading online programming learning platform n.d.; W3Schools free online web tutorials n.d.). These projects are unique in their size, domain, and architecture. Similar to previous studies and to accommodate for easy viewing in VR, we limited the code snippets to a size of 22 - 57 lines (Busjahn et al. 2015; Rodeghero et al. 2014b; Peitek et al. 2021). We kept six out of eight code snippets to under 30 lines of code. We provide the subject applications in our online appendix (see Section 4.10).

4.6 Participants

The participants in the program comprehension study were all computer science graduate students recruited through email. There were a total of 26 participants that participated. We recruited all participants from a graduate-level software engineering class at Clemson University. All these participants were fluent in English. In the desktop group, 10 participants reported intermediate Java programming experience. In the virtual reality group, eight participants reported intermediate Java programming experience, and two participants reported expert-level Java programming experience.

The desktop group consisted of four female participants and nine male participants with an average professional software development experience of 2.15 years. Eleven participants reported previous VR experience, with the majority reporting less than 1 hour of VR experience. The average age of this group was 24.76 years. The VR group had one female participant and 12 male participants with an average professional software development experience of 2.07 years. Eight of them reported previous VR experience, with the majority reporting less than five hours of VR experience. The average age of this group was 24.92 years. This data is provided in Table 1. We did not split up the participants based on their skills or experience.

Table 1 Participants demographic information

4.7 Statistical Tests

The survey data obtained in this study are ordinal in nature: the NASA-TLX survey measures scores on a twenty-point ordinal scale. We recorded measured comprehension as the number of source code correctly summarised by each participant during the study. During the post-experiment survey we recorded the perceived productivity, perceived understanding, and perceived complexity of the source code. The perceived productivity question provides ratings measured on a five-point ordinal scale (1 = no productivity; 5 = extremely productive). Participants reported their perceived understanding of the source code on a three-point ordinal scale (1 = definitely not; 3 = definitely yes). We recorded the perceived complexity on a three-point ordinal scale (1 = No; 3 = Yes). We used the Mann-Whitney U test to assess these data for differences among the VR and desktop groups (Hollander et al. 2013). This nonparametric test compares ranks among groups and is appropriate for non-numeric data. Continuity correction was applied to the U test statistic to enable computation of an approximate p-value in the presence of tied ranks. Performing multiple t-tests on subsets of the data obtained through the same observation could result in multiple comparison problem. A test’s chances of returning false significance increases as we make multiple inferences from the same observation. The p-values were calculated and then adjusted using the Benjamini-Hochberg method to avoid this problem (Benjamini and Hochberg 1995; Yekutieli and Benjamini 1999). We used Cohen’s d to calculate the effect size of data collected using NASA-TLX (Cohen 2013). The effect size quantifies the difference between the data collected from the two groups in this study. Finally, we performed a Spearman Correlation test between the combined measured performance and overall task load index of both groups (Binet 1904). This test helps us examine if measure performance is statistically correlated with the overall task load while performing code comprehension.

4.8 Equipment

We designed and developed the VirtualDesk system to conduct this experiment. The VirtualDesk is a hardware and software system that allows users to interact with a computer using a VR headset. We used HTC Vive Pro headsets (Vive Pro 2 - the best VR headset in the metaverse: United States n.d.). The headset had a Leap Motion controller attached to the front panel for hand tracking (Ultraleap: Tracking: Leap Motion controller n.d.). Two HTC Vive trackers were used to create the virtual keyboard and mouse (HTC Vive Tracker: Vive United States n.d.). One tracker was attached to a physical keyboard, and another was attached to the mouse. The virtual keyboard and mouse replicated the physical keyboard and mouse feedback when the user typed on the keyboard or clicked the mouse. The virtual keyboard provided this feedback by highlighting the key pressed and visually moving the key down and then back up. The virtual mouse would momentarily highlight the mouse button that the user pressed. Figure 4a shows a person using the VirtualDesk system with the HTC Vive Pro headset, Leap Motion controller, and the Vive trackers attached to the physical keyboard and mouse. Figure 4b shows the virtual keyboard and mouse where the ‘z’ button is pressed on the keyboard, shown in blue highlight, and the left mouse button is pressed, shown in white highlight. We used the VirtualDesk system in two previous studies to investigate programmer collaboration and code comprehension in VR (Dominic et al. 2020a; b). For more information on the equipment and setup, see the replication package in Section 4.10.

Fig. 4
figure 4

Programmer using the VR setup. The virtual keyboard and mouse is shown in the second picture. a) A programmer performing code comprehension in virtual reality. b) Virtual keyboard and mouse used in the VR environment

Fig. 5
figure 5

NASA TLX scores reported by participants while comprehending source code in VR and desktop

Fig. 6
figure 6

Distribution of perceived productivity ratings by group (1 = no productivity, 5 = extremely productive)

Fig. 7
figure 7

Weak negative correlation exhibited between Measured Compre- hension and Overall Task Load

4.9 Threats to Validity

Similar to any evaluation, our study carries threats to validity. First, participants are susceptible to various symptoms of simulator sickness while using a VR headset. These symptoms include nausea, headache, fatigue, and loss of balance. Factors like fatigue, stress, and error are all part of being human. We minimized this threat by allowing the participant to stay stationary and not require to move in VR. The virtual environment that the participant was in served as a stable horizon that reduces simulator sickness. We allowed the participants to stop any time they experienced simulator sickness symptoms or felt uncomfortable. We did not have anyone stop midway through the experiment. Second, the differences in programming experience, VR experience, and personal biases can affect our study. We cannot rule out the possibility that eliminating these factors would vary our results. However, we minimize this threat by recruiting 26 computer science graduate-level programmers rather than relying on a single graduate student or recruiting undergraduate students. These participants had an average of 2.11 years of professional programming experience. Due to COVID-19 we were unable to recruit additional professional programmers for this study. Another potential threat is the source code that we selected. There is potential that if we choose a different set of source code, programming paradigms, and a different programming language, our results might have been different. To mitigate this threat, we used source code that covered many different programming paradigms, object-oriented best practices, and common functions seen in programming. Our code snippets range in size from 22 lines to 57 lines. Thus we cannot claim that our results are generalizable to source code of arbitrary size. Finally, there is a threat that the technology used for this study might not generalize to other VR technologies. We used the state-of-the-art VR headsets and hand tracking equipment available at the time of the study. We also have made the software and all the data we collected available in Section 4.10. This will allow future researchers to conduct similar studies and build upon the system we have developed.

4.10 Reproducibility & Online Appendix

For reproducibility, we have made all data and software available via an online appendix: https://tinyurl.com/2p9au29a

5 Human Experience Results

In this section, we present our results to each research question, our rationale, and an interpretation of the answers.

5.1 RQ 1: Concentration

Participants in the VR group rated their overall task load significantly higher (p = 0.03, effect size = 1.20) than those in the desktop group. The participants in the VR group also found the source code to be more complex (p = 0.04, effect size = 1.10) compared to participants in the desktop group. We observed a large and significant difference in the overall task load and perceived complexity of source code between both groups. We present the adjusted p-values from the statistical analysis performed on the data collected during the post-experiment survey and the NASA TLX survey in Fig. 5 and Table 2.

  • Hn The difference in concentration between code comprehension in VR and desktop is statistically-significant.

Existing research shows that software developers create more bugs under increased mental and physical stress (Furuyama et al. 1996; Kuo et al. 1998). We found that participants in the VR group experienced more overall task load than those in the desktop group. We argue that this increase in overall task load resulted in the increased perceived complexity of source code among participants in the VR group.

5.2 RQ 2: Mental Demand

We did not find statistically significant evidence that participants in the VR group experienced more mental demand (p = 0.06, effect size = 0.72) or temporal demand (p = 0.24, effect size = 0.50) than those in the desktop group. These results do not show any differences in the mental demand or temporal demand experienced by both groups. We present the statistical test results on data collected using the NASA TLX survey in Fig. 5 and Table 2.

  • Hn The difference in mental load between code comprehension in VR and desktop is not statistically-significant.

Table 2 Benjamini-Hochberg adjusted p-values and Cohen’s d on self reported NASA TLX scores, perceived complexity, perceived productivity, and measured productivity

While designing the virtual environment, we provided all features that participants would have access to on a desktop computer. This was to make the interactions in VR as natural as possible. The participants in the VR group had access to the virtual keyboard and mouse and the virtual desktop, which mirrored the screen of the computer they were working on. It is well known that the introduction of new technology could distract the users, temporarily reduce productivity and have unintended consequences (Blok et al. 2009; Beland and Murphy 2016). We believe that having minimal deviations while introducing a new technology helped to keep mental demand and temporal demand similar between both groups.

5.3 RQ 3: Frustration

We found no statistically significant evidence that participants doing virtual comprehension were more frustrated (p = 0.12, effect size = 0.47) than partici- pants doing desktop comprehension. Our results did not show any statistically significant difference in frustration experienced between the groups. We present the results from our statistical analysis on data collected using NASA TLX survey in Fig. 5 and Table 2.

  • Hn The difference in frustration between code comprehension in VR and desktop is not statistically-significant.

The VirtualDesk system required participants to use a VR headset and interact with the computer using the virtual keyboard and mouse. Unfortunately, the Leap Motion controller introduced jitter and drift artifacts while tracking the participant’s hands, resulting in sub-optimal hand tracking, making it more challenging for the participants to interact with the computer. The results do not show any statistically significant difference in frustration between partici- pants in either setting. It is important to note that the virtual keyboard and mouse were developed exclusively for the VirtualDesk system. However, we cannot rule out the possibility that participants with previous VR experience might have performed better with a shorter learning curve.

5.4 RQ 4: Physical Demand

We found statistically significant evidence that physical demand (p = 0.04, effect size = 0.94) and effort (p = 0.04, effect size = 0.76) were higher among participants in the VR group than those in the desktop group. Our results show a large and significant difference in the physical demand and a medium but significant difference in effort experienced by both groups. We present the results of the statistical analysis in Fig. 5 and Table 2.

  • Hn The difference in physical demand between code comprehension in VR and desktop is statistically-significant.

Participants in the VR group found the headset to be heavy and the keyboard alignment to be subpar. Three participants from the VR group reported that the VR screen was blurry and was hard to read. In the VirtualDesk system, we provided a stable horizon, allowed participants to remain seated, and avoided locomotion within the virtual environment to reduce simulator sickness. However, four of the participants from the VR group reported either eye strain or headache during code comprehension. It is possible that a lighter, newer VR headset with higher display resolution could help alleviate some of the difficulties experienced by the participants.

5.5 RQ 5: Perceived productivity

We found no statistically significant evidence that suggests any difference in perceived productivity (p = 0.16, effect size = 0.74) or the NASA TLX per- formance rating (p = 0.71, effect size = 0.17) between participants in the VR group and the desktop group. Furthermore, 11 out of 13 participants in the VR group reported “definitely yes” when asked, “Did you understand the source code?” during the post-experiment survey. Interestingly, 11 out of 13 participants in the desktop group also reported “definitely yes” when asked the same question. The remaining two participants in both groups reported “somewhat” in response to the question. Our results do not show any difference between both groups’ perceived productivity or performance rating. We present the results of the statistical analysis performed on the post-experiment survey and the NASA TLX scores in Fig. 5 and Table 2 and the self-reported productivity score for each group in Fig. 6.

  • Hn The difference in perceived productivity between code comprehension in VR and desktop is not statistically-significant.

Even though the use of VR for code comprehension is a novel approach, participants did not report significant differences in perceived productivity. Interestingly, there was no difference in the responses to the perceived understanding of source code between both groups. These observations reflect the performance ratings obtained using the NASA TLX survey, where we did not observe any statistically significant differences between the groups.

5.6 RQ 6: Measured Comprehension

We found evidence that participants in the desktop group correctly comprehended a greater percentage of code snippets compared to those in the VR group. Participants in the desktop group comprehended 93 summaries and 70 (75%) of those were correctly comprehended. Participants in the VR group comprehended 84 summaries, and 55 (65%) of those were correctly comprehended. On average, each participant in the desktop group comprehended 5.38 summaries correctly, whereas participants in the VR group only comprehended an average of 4.23 summaries correctly. However, we did not find this difference in measured comprehension statistically significant (p = 0.328).

  • Hn The difference in measured productivity between code comprehension in VR and desktop is not statistically-significant.

Figure 7 shows the correlation between measured comprehension and overall task load. The Spearman Correlation test revealed a slight negative correlation (r = -0.357) between both groups’ combined measured comprehension and overall task load index. This negative correlation means that the measured comprehension decreases as the overall task load increases. However, we did not find any statistical significance (p = 0.072) to this correlation. Since the alpha value is very close to 0.05, we cannot rule out the possibility that having a larger sample size would result a stronger correlation between overall task load

and measured comprehension. The above findings lead us to believe that there is no statistically significant difference in measured comprehension between the desktop and VR groups.

5.7 Summary of human experience results

We derive a few key observations from the study results. The participants experienced more challenges during code comprehension in VR over desktop. We base this observation on the statistical significance we found in the higher levels of physical demand, effort, and perceived complexity reported by the participants in the VR group on the NASA TLX survey during code comprehension. Participants in the VR group also reported a higher overall task load, reflected in the overall task load scores. Participants completed fewer code comprehension tasks and even fewer comprehended correctly in VR.

Even though code comprehension in VR placed a higher task load on the participant, we did not find statistically significant evidence suggesting that code comprehension in VR is less beneficial than code comprehension on a desktop computer. We found no statistically significant difference in perceived productivity, performance rating, or perceived understanding of source code between participants in both groups. We did not find any statistical significance for the differences in the actual number of source code snippets correctly comprehended in both groups. Our future work includes improving the hardware and software limitations that resulted in higher task load during code comprehension in VR, which could yield better results supporting the use of virtual environments for code comprehension in future studies.

6 Virtual Reality Study Lessons Learned

This section discusses our experiences and the lessons learned while conducting the code comprehension study in VR. We provided as little assistance as possible to the participants during the study. Unfortunately, once they were in the virtual environment, they were unsure of the study’s instructions, often requiring us to repeat the instructions. We recommend letting participants get used to the VR headset and the virtual environment before providing the study instruction. We enlarged the Windows desktop and applications to compensate for the VR headset’s limited resolution. As a result, the participants could only see about 20 lines of code at any given time and needed to scroll the Visual Studio Code window to see all source code at once. The magnified view also made it difficult for many participants to use the Visual Studio Code interface in VR. The participants spend much of their time scrolling to view all of the code and not successfully documenting the code summary.

During our study, other lessons learned were the limitations of the VR system and the computer vision technologies used. The Leap Motion hand tracking system was not accurate enough to properly align the participant’s fingers with the virtual keyboard for typing in VR. Although each participant had the virtual keyboard aligned to their fingers before the study began, it went out of alignment if the participant made significant adjustments to the headset. This slowed down many participants while documenting code summaries. Many of the participants could not touch type, causing them to make many errors during the summary documentation process, which further slowed them down. We provided a keyboard alignment tool accessed through a space bar and “shift” key press on the keyboard to realign the keyboard as necessary. During the study, a few participants in both groups did not document the output for the given source code. However, all participants documented the source code summary. Hence we decided to consider both the output and the summary while determining the correctness of code comprehension.

7 Discussion

Our paper advances the state-of-the-art in two directions. We contribute to program comprehension literature by adding to the research investigating how programmers comprehend source code in VR. We do this by contributing empirical evidence of programmer behavior during source code summarization tasks in VR. We used the tool presented in Section 4 during these studies.

We recorded the source code summaries of 26 graduate-level software engineering students (13 in VR and 13 using desktop). They evaluated 8 different Java projects, each covering different domains, paradigms, and functionality. We released all of the survey data, summaries, evaluations, and source code via our online appendix (see Section 4.10) to promote independent research. At the same time, we have analyzed the data and found quantitative evidence that source code comprehension in VR is possible. We did not find any statistically significant differences in the measured productivity or the perceived productivity of the programmers between VR and desktop groups. Programmers in VR groups did not report more statistically significant mental demand or frustration than the desktop group. One of the participants from the VR group was excited about using VR for code comprehension and the prospect of using similar systems in the future.

“The experiment was quite fun to perform, and could open up new ways to how everyday activities such as programming can be performed.” (P318)

However, programmers experienced various challenges while performing code comprehension in VR. Participants reported higher levels of physical demand and effort in VR. We believe this was due to the difficulty of using the virtual keyboard and reading text in VR. We used an HTC Vive tracker to track the physical keyboard; and a Leap Motion controller for hand tracking. Unfortunately, this combination introduced jitter and drift artifacts, making typing very difficult. The new version of VR headsets are both lighter and comes with inbuilt hand tracking, significantly reducing the jitter and drift otherwise introduced by the two interacting systems, HTC Vive Pro and Leap Motion. Participants in the VR group commented about using the virtual keyboard during code comprehension.

“I found it challenging to touch my way around they keyboard when I had the headset on compared to how I normally use the keyboard.” (P304)

“The keyboard alignment needs some adjustment, which can take a while and can be frustrating as you are backspacing a lot.” (P300)

The HTC Vive headset has a resolution of 2160×1200 pixels per eye. Even though it is not a small pixel count in most normal circumstances, it has limited text rendering capabilities when considering the 120-degree horizontal field of

view. VR headsets such as the Pimax Vision 8K X offers 3840x2160 resolution per eye, which might offer a more pleasing experience reading text in VR (Pimax Vision 8K X 2022). We tried to mitigate this issue by enlarging the text and the windows interface, which made the text more readable. Even so, in the post-experiment survey, seven participants in the VR group mentioned having trouble reading the text in VR. Below we list the responses from two participants in the VR group.

“The hardest part was the loss of vision if that makes sense. It was less clear, had to focus more, maybe there’s a setting to make it more crisp.” (P307

“After the first eight minutes or so my eyes started getting really irritated and burned for whatever reason. Made it really uncomfortable and I had to kinda close my eyes for a bit.” (P321)

We contribute to measuring the human experience during code comprehension by using the NASA TLX survey. Our study, coupled with the NASA TLX results, produced statistical results for the overall task load, mental demand, physical demand, temporal demand, performance rating, frustration, perceived complexity, perceived productivity, and measured comprehension. Our evaluation found that programmers experience more physical demand and overall task load in VR than desktop code comprehension. Programmers also reported higher statistically significant perceived complexity and effort in VR. However, we did not observe any statistically significant differences in mental demand, temporal demand, performance rating, frustration, perceived productivity, or measured productivity between either group of participants. We provide the results for each of these variables in Table 2.

While conducting the experiments, we provided a very brief pre-training session to the participants using VR. The pre-training session involved finding a comfortable position for the VR headset on their face and introducing them to the keyboard alignment software. The participants aligned the virtual keyboard to their hands and started code comprehension. A participant in the VR group said the below about the lack of pre-training.

“The vision seemed hard while concentrating on the screen. Could have performed better with more exposure to the VR environment and less stress on eyes.” (P314)

We cannot rule out the possibility that programmers with more pre-training in VR or more programming experience in virtual environments could perform better than those in our VR group.

8 Recommendations

Our paper offers a framework for future research in remote collaboration. As more programmers move towards remote work, a remote virtual environment can provide an office environment for programmers. Furthermore, VR opens many possibilities for remote collaboration. There are various examples of VR remote collaboration. The VirtualDesk environment already supports multiple users in the same virtual space (Dominic et al. 2020b). We found that programmers collaborated more efficiently using VirtualDesk versus using video conferencing tools. Based on our observations and results, we recommend the following:

Pre-training in VR

We recommend that future researchers using VirtualDesk and similar VR systems for software engineering provide ample pre-training of the environment before starting the experiment. Some programmers wanted us to repeat the study instructions once they were in VR. We believe that having a pre-training session will help the programmer become familiar with the virtual environment and understand the instructions more clearly. During our study, the programmers could use the interface as intended once we repeated the instructions with them in VR.

Short Sessions in VR

We recommend that programmers use VR headsets for short periods of time. We do not recommend that programmers spend all day working on a virtual headset. A short VR session can be highly productive as programmers can focus on the task without any external distractions. Unfortunately, using a heavy VR headset and other necessary equipment can be fatiguing for longer sessions. As VR headsets become lighter, untethered, and get integrated hand tracking, programmers will be able to spend more time in VR without fatigue.

VR for Software Development Team Connection

We recommend that VR is used for social connection within software development teams. As VR is used in many other domains as a social connection tool, we believe that there may be advantages for software development teams to build social connection and maintain it while teams work in a hybrid or remote model. We believe that further exploration is required in this area.

In addition, we will continue working in this area and hope to see others extend this work. Some of the extensions we have started working on include adding a whiteboard feature in VR. This feature allows programmers to take notes and discuss them with someone else in VR or outside VR. In this paper’s work, we found that using VR increased the task load on programmers. We believe that adding a whiteboard will provide a more natural way of taking notes during source code comprehension tasks in VR.

9 Limitations

Limitations of our study include problems with the VR hardware used by programmers. The limited accuracy range of the Leap Motion controller introduced jitter and drift on the hand tracking used in this study. We developed a virtual keyboard calibration system for the user to recalibrate their virtual keyboard and hand calibration to mitigate this issue. The programmers could activate this function with the alt button on the keyboard and then use the arrow keys, which are easier to touch and recognize.

Another limitation of our study is that we used students and not professional programmers. We recruited computer science graduate students to participate in the study. Programmers in both VR and desktop environments reported an average of 2.5 years of professional programming experience. Unfortunately, due to the COVID-19 pandemic, we could not recruit additional professional programmers to participate in the study.

10 Conclusion

In this paper, we presented a study exploring the use of VR for program comprehension by software engineers. We compare the use of VR for program comprehension against desktop program comprehension. We have explored six research questions to understand how VR affects programmers’ ability to com- prehend source code. We showed that programmers working in VR face more difficulties comprehending code versus desktop program comprehension. How- ever, we found no statistically significant evidence that suggests a difference in measured productivity or perceived productivity between comprehension in VR and comprehension in a desktop setup.