1 Introduction

User experience (Tullis and Albert 2013) goes beyond the traditional concept of usability by including emotional and aesthetical factors (Bachmann et al. 2018). As defined by ISO 9241-210 (ISO 2010), it encompasses “perceptions and responses that result from the use or anticipated use of a product, system or service.” User task performance concerns specific tasks such as navigation, object placement and object selection. An important factor influencing the user performance is the mental workload (Cain 2004), also known as cognitive workload, which represents the mental effort required to perform the tasks. It can be assessed by means of quantitative performance tests, physiological measures or subjective feedback collected through questionnaires (Moustafa et al. 2017).

A key aspect for the effectiveness of virtual environments is the sense of presence. It denotes the subjective experience of being in one place or environment, independently of where a subject is actually located (Witmer and Singer 1998). Another important aspect that characterizes the experience in virtual environments is user interaction. Touchless interfaces enable new forms of natural interaction, where well-defined positions or movements of some parts of the human body, known as gestures (Garber 2013), are associated with meaningful commands. For instance, gestures allow doctors to consult medical data during interventions (De Paolis 2016). Moreover, in image-guided navigation systems (De Mauro et al. 2012) and augmented reality applications (De Paolis and De Luca 2019; De Paolis and Ricciardi 2018) for the intraoperative support of surgical procedures, a touchless interaction with virtual organs (De Paolis 2018; Indraccolo and De Paolis 2017) is fully compliant with sanitary regulations. In military operations, it can increase the situational awareness by providing a faster and more natural way to issue commands (Zocco et al. 2015). In addition to offering greater intuitiveness compared to traditional WIMP (Windows Icon Mouse Pointer) interfaces, gesture-based interaction can increase motivation in the educational field (De Paolis et al. 2019) by fostering kinesthetic learning (Kosmas et al. 2018), where the cognitive process is enhanced through the physical involvement of the human body.

In a virtual environment, gesture-based interaction can improve immersion as long as no conscious attention is required to the specific gestures being performed (Deller et al. 2006). The concept of guessability (Wobbrock et al. 2005), which is a stronger concept than immediate usability, represents the ability to guess symbolic input without any prior knowledge or learning phase. Unfortunately, the design of gesture detection systems is often based more on ease of technology implementation than on naturalness of gestures: a typical example is the Myo armband, a wearable device that can detect five basic gestures based on the arm electromyographic impulses it can recognize (Rawat et al. 2016).

This paper investigates the differences between handheld controller and gesture-based interaction in a virtual reality experience. In particular, it aims to evaluate the impact that abstract gestures based on electromyography (EMG) technology can have on usability and sense of presence. It is expected that the main obstacles may derive either from the fact that gestures are abstract/symbolic (as they do not deal with grasping or directly manipulating objects), or from any problems of accuracy in detection that are intrinsic to the device. This will allow us to understand whether future research should focus more on improving the accuracy of detection, expanding and diversifying the vocabulary of abstract gestures, or introducing the possibility of detecting ’fine’ movements, also involving the fingers, to aim at the recognition of less abstract gestures and closer to manipulative ones.

The considered scenario is a virtual navigator for the HTC Vive headset (HTC 2021) that allows the user to explore the organs of the human body and navigate inside them. The study started with earlier work (De Paolis and De Luca 2020), which compared the performance of the Myo armband with that of the Vive controller, a handheld device bundled with the HTC Vive headset. In such context, data on user impressions were collected through the Usability Metric for User eXperience (UMUX) (Finstad 2010), the System Usability Scale (SUS) (Brooke 1996) and the presence questionnaire (PQ) (Witmer and Singer 1998). While the previous work compared various factors of the user experience between the two devices, trying to identify any interdependence between several components (involvement, interface quality, adaptation/immersion, visual fidelity), this work goes deeper in the search for the possible causes of the differences in terms of user experience taking into account the peculiarities of the two interfaces.

The rest of the paper is structured in this way: Sect. 2 presents the related work about presence and interaction in virtual environments; Sect. 3 introduces the main factors of user experience in a virtual environment; Sect. 4 briefly summarizes the target experimental scenario, which was introduced and described more in detail in our previous work (De Paolis and De Luca 2020); Sect. 5 presents the questionnaires chosen for our study, after a brief introduction on the most frequently used questionnaires for the evaluation of usability and user experience; Sects. 6, 7, 8 and 9 present the results; Sect. 10 discusses and summarizes the main findings; Sect. 11 concludes the paper.

2 Related work

Several studies in the literature addressed presence in virtual environments. Mutual relations between presence and emotions were discussed in Diemer et al. (2015) with a special focus on the perception of fear. The technological, cognitive and emotional factors that increase the sense of presence were analyzed in Gorini et al. (2011), which considers a virtual hospital scenario. The effects of immersive VR on empathy were evaluated in terms of immersion-presence, illusion of body ownership, illusion of agency, engagement and mind-wandering (Barbot and Kaufman 2020): the results of the study highlighted the role of VR as a “perspective taking-machine”. Usability, presence and perceived workload were assessed in a virtual operating room (Li et al. 2020) recreated through an Oculus Rift VR headset and a LapMentor III laparoscopic simulator: the high mental demand and low frustration revealed by the experimental tests suggested that users tend to enjoy intellectual challenges provided by the virtual environment. Presence, workload, usability and flow were evaluated in a study comparing a video display terminal and a head-mounted display in walking and driving scenarios (Rhiu et al. 2020): the former is better in terms of workload, while the latter provides a higher sense of presence and flow in direct interaction; on the other hand, in mediated interaction, such as in driving a vehicle, a head-mounted display has also a lower usability. An older study (Livatino et al. 2015) assessed the usability of a stereoscopic bronchoscope in terms of depth impression, presence and comfort. It compared seven systems (from laptops to wall screen and head-mounted display), four approaches to stereo viewing (colored Anaglyph, polarized filters, shutter glasses, and separated displays), and five types of display technologies (digital light processing, cathode ray tube, liquid crystal display, LED, and organic LED). Another comparative study between a head-mounted display and a desktop LCD screen in driving simulation focused on sickness, subjective eye symptoms, game engagement and game performance (Cao et al. 2020): it revealed a moderately higher simulator sickness for the head-mounted display. Other usability studies comparing various visualization systems for virtual environments can be found in Yu et al. (2019), Somrak et al. (2019), Webster and Dues (2017), Tcha-Tokey et al. (2017).

The influence of navigation control and screen size was assessed in Clemente et al. (2014) by recording brain activation with an EEG. Three different VR systems for firefighter training were compared in terms of usability, ergonomics and learning effectiveness (Corelli et al. 2020): in such context, the locomotion management is a key aspect that influences accuracy and timing of the execution due to the need to cover long distances and exert a fine movement control. Teleportation and natural walking were studied as interaction modalities for abstract data exploration in a virtual environment enhanced with haptic feedback (Zenner et al. 2020): a comparison with a traditional 2D tablet interface revealed a tradeoff between efficiency and user interest, but no difference in model understandability.

In most virtual environments, an eye-centered interaction principle has been adopted, since the user’s visual searching is typically considered more efficient and comfortable than hand interaction. The influence of arm movements was studied in a free hand target selection experiment in a virtual environment (Lou et al. 2020), with a special focus on the hand choice and on the hand position. In Pai et al. (2019), gaze tracking and forearm contractions, detected through electromyography, were associated with cursor movements and selections: this combined modality for user interaction was proved to perform better than HTC Vive controller, Xbox gamepad, dwelling time and eye-gaze dwelling time.

Other experimental studies evaluated the impact of touchless and gesture-based interaction on the user experience.

The Leap Motion controller (Leap Motion 2021) and the Myo armband were compared with a standard game controller in a ball-balancing maze-like game scenario (Chen et al. 2018). Despite overall good performance, Leap Motion and Myo provide a lower level of control, but the gap with the traditional controller tends to narrow as users become more familiar with the devices. In particular, the understanding of Myo’s physical dynamics gained gradually through practice gives better control than Leap Motion, which makes it more difficult to move objects accurately due to the lack of feedback. Users said they found interaction through devices such as Leap Motion and Myo more interesting than through traditional controllers, which they considered “aged.”

Another work (Wirth et al. 2018) also revealed that the interaction through EMG devices is considered to be more interesting and exciting than interaction through handheld controllers. On the other hand, users involved in this experiment declared the EMG experience was more stressful, as it requires more thinking. Similarly, experimental tests on virtual manipulation based on Leap Motion highlighted some interaction fidelity issues in grabbing virtual objects, caused by a difficulty in perceiving the position of their fingers (Gieser et al. 2016).

Two interaction modes for teleportation based on one hand and two other based on two hands are compared in Schäfer et al. (2021): results revealed that single hand gestures generally provide a more comfortable and effective way to move in a virtual environment, whereas two hands require a higher workload.

The voice and motion gesture controller presented in Niño et al. (2019), which also provides tactile feedback, proved that interaction has a significant influence on the sense of presence, embodiment and immersion.

The effects of gesture-based interaction in a computer-based science lesson were studied in Bailey et al. (2019): experimental tests proved that gestures perceived as natural contribute in increasing not only the sense of control, but also the feeling of presence; on the contrary, gestures perceived as unnatural and not very usable can generate interferences that may distract the user from a task.

3 The factors of user experience in a virtual environment

The following subsections define the main factors characterizing the user experience in a virtual environment that outline the framework of this study: presence, immersion and interaction.

3.1 Presence

Presence refers to the sensation of being in a place other than the actual physical location, achieved by deceiving the user’s cognitive and perceptual systems (Slater et al. 1994).

Riva et al. (2004) identified three layers of presence:

  • proto presence, based on kinesthetic information about the relative position of the user’s body;

  • core presence, representing the process of selective attention applied to perceptions;

  • extended presence, which enhances the significance of external events and reinforces the presence of the subject in significant experiences.

According to Slater (2009), two perceptual illusions simultaneously contribute to the sense of presence: the place illusion of being really in a place and the plausibility illusion that events are really happening.

Place illusion is not closely related to perceptual realism (Sanchez-Vives and Slater 2005), which indicates the extent to which virtual representations appear similar to the corresponding real objects. Plausibility illusion, on the other hand, shares some aspects with social realism (Nilsson et al. 2016), which is concerned with the level of real-life adherence of a multimedia representation.

Even though place illusion is a subjective experience, it partially depends on system immersion, which can be described in terms of objective properties (Slater 2003) (such as frame rate, fidelity of tracking and field of view) and measured in quantitative terms. An immersive system can be described through its sensorimotor contingencies (Slater 2009), which are the actions performed to explore the environment and inspect the objects in it. On the contrary, the sense of presence is not an objectively measurable quantity: it can vary between different subjects according to the different actions they perform, as one individual, for example, may explore the environment more widely than another.

Plausibility illusion requires that (Rovira et al. 2009):

  • each action performed by the user corresponds to a specific reaction in the virtual environment;

  • the environment always responds directly to the user;

  • the events within the environment correspond to the user’s expectations deriving from everyday life.

Waterworth and Waterworth (2001) described presence as “a conscious emphasis on direct perception of currently present stimuli rather than on conceptual processing.” In particular, he identified three dimensions in the allocation of attention in a virtual environment:

  • focus of attention is directly related to the sense of presence, which increases when a low degree of conceptual and abstract reasoning is required; a technologically immersive system plays an important role in this dimension insofar as it provides concrete information that can be processed directly by perceptual-motor systems;

  • locus of attention deals with the attention paid to the virtual environment rather than to the external environment;

  • sensus of attention represents the degree of conscious arousal, i.e., the basic psychological response to external stimuli.

3.2 Immersion

According to Agrawal et al. (2020), immersion is a state of deep mental involvement characterized by a shift of attention from the awareness of the physical world and can be described through three main factors. The first one is the subjective sense of being surrounded, known as perceptual immersion. Another factor is the absorption in the narrative, which causes a shift of attention in the player: it consists of spatial immersion, which concerns the sense of space and the act of exploration, temporal immersion, which derives from the curiosity to know how a story evolves, and emotional immersion, which refers to the emotional attachment with a story character (Ryan et al. 2003). The third factor is made up of strategic immersion and tactical immersion: the former derives from the absorption when a player is making choices, while the latter derives from the attention on tasks requiring quick reactions. Opposed to the concept of immersion based on strategic or tactical challenges, systematic immersion (Arsenault 2005), which is more suitable to non-participatory activities, simply focuses on the acceptance of the game’s system instead of real-world laws of physics.

In the various papers presented so far, immersion has been described sometimes as the ability of technology to faithfully reproduce real-world perception and actions, known as system immersion, and other times as a subjective experience induced by specific devices (Nilsson et al. 2016).

According to Witmer and Singer (1998), who designed the Presence Questionnaire employed in this study, the psychological state deriving from directing attention to a stimulus is called involvement, while immersion is considered as the subjective experience of being surrounded in an interactive environment. According to Witmer’s view, the main factors influencing immersion are:

  • the level of isolation from the physical environment;

  • the sense of self-inclusion in the virtual environment;

  • egocentric motion perception and the natural interaction with the environment.

A concept close to immersion is that of flow, defined as “the state in which people are so involved in an activity that nothing else seems to matter” (Csikszentmihalyi 1991). It was described through the following eight components: balance between ability and challenge, concentration/attention, clear goals, immediate feedback, escape from everyday life, sense of personal control, loss of self-consciousness and altered sense of time. Passive activities, in which some of these factors (balance between ability and challenge, clear goals, immediate feedback) are absent, cannot be considered as flow experiences, although they may induce immersion.

3.3 Interaction

An important concept underlying the interaction in a virtual environment is discoverability (Norman 2013), which refers to the possibility of exploring how the environment works and what operations are possible. Norman identified three phases in interaction (Norman 2013): forming the goal, executing the action and evaluating the results. Execution consists of planning the action, specifying the action sequence and performing it. Evaluation consists of perceiving the state of the world, interpreting it and comparing it with the original goal. The user is only vaguely aware during the execution and evaluation phases. Only the appearance of something new or an obstacle awakens conscious attention. In particular, the processes of specifying actions and interpreting perception occur in a semiconscious manner; on the contrary, the processes of performing actions and perceiving are usually automatic and subconscious unless some action requires special attention. Gulf of execution refers to the situation where it is unclear how to achieve an objective that is known, while gulf of evaluation refers to not understanding the results of an action (Jerald 2015).

The degree to which physical actions required for a task in a virtual environment correspond to the physical actions required by the equivalent real-world task is known as interaction fidelity (Bowman et al. 2012). Realistic interactions avoid adaptation problems, which may have a negative impact on the effectiveness of a training procedure. However, even non-realistic magic interactions, based on metaphors that enable gestural interaction with distant objects, can improve the user experience by overcoming the limitations of the real world. In both realistic and non-realistic interactions, the intuitiveness of the interface is crucial. Interaction metaphors (Jerald 2015) are interaction modalities that refer to specific knowledge deriving from other domains. An ideal virtual environment should be consistent with the conceptual model in the mind of each user, without any explanation.

In general, two key interaction fidelity factors that are crucial to achieve good performance and avoid user frustration are input veracity and control symmetry (McMahan et al. 2015): the former represents the ability of an input device to capture user actions in terms of accuracy, precision, and latency, while the latter is the level of control in the interaction compared to the equivalent real-world task.

3.3.1 Touchless interaction with virtual objects

Natural User Interfaces (NUIs) were introduced to allow users to interact with a computer system through actions similar to those performed in the real world. These devices include touchless interfaces, which allow commands to be given in the form of hand gestures. In virtual environments, they also offer the possibility to move or manipulate virtual objects by means of gestures or by simulating the act of grasping them.

Various criteria for gesture classification were discussed in Li et al. (2019):

  • from the point of view of spatiotemporal status, gestures can be static or dynamic: the former are static poses at a given instant of time that do not include information about time series, while the latter represent changes in poses over a certain period of time;

  • from a semantic point of view, manipulative gestures allow moving or rotating objects through arm or hand movements, whereas communicative gestures have a specific information function;

  • from the point of view of the scope of interaction, stroke gestures consist in moving the hand on a support surface such as a touchscreen, whereas mid-air gestures consist in movements in a free space without any support surface. The former exhibit a better ability to capture the details of actions, while the latter allow a larger volume of interaction.

Another systematic review (Vuletic et al. 2019) classified hand gestures into the following categories:

  • deictic gestures, i.e., pointing gestures used to indicate a direction or a point selection or to move an object along a path created by the hand;

  • pantomimic gestures, which emulate the real-life actions performed to pick up, pull and modify parts of an objects;

  • free-form gestures, used to move virtual objects, windows or pointers;

  • manipulative gestures, associated with translation, rotation, scaling/zooming or object size manipulation;

  • semaphoric gestures, which consist in abstract and predefined hand motions representing concepts.

Other studies analyzed which gestures are preferred by users (Chen et al. 2015) and perceived as natural (Grandhi et al. 2011).

Visual affordances play a critical role on several gestures operating on virtual objects (Kang et al. 2020): they concern the direct perception of the potential actions that can be performed with them without the need for high-level processes such as reasoning about object properties (Thill et al. 2013). Electroencephalographic (EEG) signals, gathered through some proper headsets (Invitto et al. 2015, 2016), were analyzed to study the activation of brain areas for affordances during gesture-based tasks.

On the contrary, the abstract gestures detected by the Myo armband, which are the subject of this study (described in Sect. 4), fall into the category of semaphoric gestures and are not related to the concept of grasping an object. Myo is based on electromyography (EMG), which enables the possibility to recognize gestures by detecting the user’s motion from changes in the muscle physiological signals: EMG-based devices are able to detect changes of the electric current by means of some electrodes placed on the skin surface (Kim et al. 2008).

Several studies have been carried out to find the best ways to arrange sensors on the user’s arm for gesture recognition (Saponas et al. 2008, 2009; Benko et al. 2009; Saponas et al. 2010), and several methods (Jo and Oh 2020; Lu et al. 2020; Benalcazar et al. 2018; Wan et al. 2017; Kratz et al. 2012; Huang et al. 2015; Dai et al. 2021) have been proposed to improve detection and classification accuracy. The Myo-based gesture classification was compared in terms of accuracy with that based on a more expensive EMG system that provides a higher sampling rate (Gieser et al. 2017): the experimental tests showed that similar results can be achieved for the two devices through KNN classification; moreover, feature extraction gives better classification results than the use of raw data. Other studies (Dolopikos et al. 2021) proved that a model calibration process is able to produce an improvement of 24% in gesture classification accuracy.

4 The considered scenario

The considered application, developed for the HTC Vive headset (HTC 2021) in Unity (Unity Technologies 2021) with the SteamVR API (Steam 2021), is a virtual navigator of the human body: it provides a combined visualization where 3D models of organs are intersected by the plane of the CT images used to reconstruct them.

A total of 84 subjects were recruited to test the virtual reality application: half of them used the Vive controller, while the other ones used the Myo armband.

As explained in our previous work (De Paolis and De Luca 2020), commands for zooming/rotating the 3D models and for translating the CT slices were associated with buttons on the Vive controller and gestures detected by the Myo armband (Fig. 1). A preliminary calibration process was necessary for each user wearing the Myo armband to adjust the device to the forearm’s morphology and improve gesture detection.

Fig. 1
figure 1

Myo gestures for activation, zoom, rotation and CT slice control

After the virtual reality experience, each users was requested to fill some questionnaires as described in the following section.

5 Evaluation of usability and user experience for Vive and Myo interaction modalities

Several questionnaires were proposed to measure the perceived usability of a system (Assila et al. 2016; Lewis 2018; Hajesmaeel-Gohari and Bahaadinbeigy 2021).

The Software Usability Measurement Inventory (SUMI) (Kirakowski and Corbett 1993) is a long and complex questionnaire designed to assess “software quality from the end user’s point of view.”

The System Usability Scale (SUS) (Brooke 1996) is a simpler questionnaire, made up of 10 items, forming up two subscales (Lewis and Sauro 2009): learnability, which refers to the ability of quickly and independently learning how to use a system, and usability in a more strict sense.

The more recent Usability Metric for User Experience (UMUX) questionnaire (Finstad 2010), made up of 4 items, was designed to make usability evaluation compliant with ISO 9241-11 (International Organization For Standardization 1998; ISO 2018) by including effectiveness, efficiency and satisfaction. Users are requested to express their opinion on each of the 4 items according to a 7-point Likert scale from “Strongly Disagree” to “Strongly Agree.”

Alongside the classic concept of usability, user experience was introduced to cover additional factors such as usefulness, emotional factors and design elegance (Vosinakis and Koutsabasis 2018). Although the questionnaires such as SUMI and UMUX took a first step toward user experience by including a user satisfaction component, other specific questionnaires were designed for user experience evaluation. A first example was the User Experience Questionnaire (UEQ) (Laugwitz et al. 2008), made up of 26 items grouped into 6 scales (attractiveness, perspicuity, efficiency, dependability, stimulation and novelty). The more recent User eXperience Context Scale (UXCS) (Lallemand and Koenig 2020), made up of 30 items, focuses on the evaluation of objective and perceived contextual dimensions in user experience.

In this study, the UMUX and SUS questionnaires were employed to evaluate the perceived usability, while the Presence Questionnaire (PQ) (Witmer and Singer 1998) was used to assess the user experience in the virtual environment. For each questionnaire, a comparative analysis was carried out among three datasets:

  • the Vive dataset, made up of the item scores collected for the 42 subjects using the Vive controller;

  • the Myo dataset, made up of the item scores collected for the 42 subjects using the Myo armband;

  • the All dataset, made up of the item scores collected for all the 84 subjects involved in the test.

The difference between this work and the previous one, described in the introductory section, results in a different type of approach adopted in the analysis of the data collected through the three questionnaires. Whereas in the previous work, Principal Component Analysis was employed to identify interdependencies between the factors into which the items of the questionnaire are grouped, this new analysis addresses the variability of the answers given to each individual item. The first part of the analysis considers the overall scores of the questionnaires in Sect. 6. Then, the differences between the factors/components and the effects on the individual items are evaluated in the subsequent sections.

Besides the mean values, we computed the coefficient of variation, which is the ratio of the scores’ standard deviation to the mean value, to evaluate the level of dispersion, which is an indication of how much users’ opinions differ. A significantly higher coefficient of variation for an interaction modality suggests users’ impressions about a questionnaire item are more in disagreement. Moreover, a high coefficient of variation in the All dataset and a low coefficient of variation in the Vive and in the Myo datasets would suggest a clear influence of the interaction modality on users’ opinions, which in this case appear quite different between the two datasets while remaining consistent within each of them.

The datasets collected for the Vive controller and for the Myo armband were compared also by means of a Kruskal–Wallis nonparametric test (Kruskal and Wallis 1952): p values lower than the 0.05 threshold (highlighted in bold in the tables of the following sections) suggest significant differences between the two interaction modalities.

6 Overall scores of UMUX, SUS and PQ questionnaires

Table 1 reports the mean values and the coefficients of variation computed on UMUX, SUS and PQ scores computed on the three datasets. For a correct comparison between scores computed on different scales (as SUS scores varied between 0 and 4 while UMUX and PQ scores varied between 0 and 6), the mean values shown in the table were normalized by dividing SUS scores by 4 and UMUX and PQ scores by 6.

Table 1 Normalized mean values, coefficients of variation and Kruskal–Wallis p values for the scores of the questionnaires

Kruskal–Wallis p values suggest a significant difference between the two interaction modalities on all the three questionnaire scores. The UMUX score is characterized by the highest difference between Vive’s and Myo’s mean values.

The charts in Fig. 2 compare the probability density functions between Vive and Myo datasets for UMUX, SUS and PQ scores.

Fig. 2
figure 2

Density plots for the UMUX, SUS and PQ scores

7 UMUX items

The scores of odd items, which measure the user’s satisfaction and the perceived ease of use, are computed by subtracting 1 from the scale value chosen by the user. The scores of even items, which represent the perceived frustration and the system reliability, are computed by subtracting the scale value chosen by the user from 7. In this way, all the scores increase with users’ positive impressions: high scores for items 1 and 3 denote high levels of satisfaction and ease of use, whereas high scores of items 2 and 4 denote low levels of frustration and the absence of system malfunctions.

For each UMUX item, Table 2 reports the mean values and the coefficients of variation computed over the scores of the three datasets presented in the previous section. The numbers in brackets specify the ranking in decreasing order based on the mean scores and on the coefficients of variation.

To focus only on the significant differences among such values, the k-means method was employed to cluster the UMUX items according to mean scores and coefficients of variation. The relevant number of clusters was estimated by means of the NbClust package (Charrad et al. 2014), available for R software. Three clusters for mean scores and three other clusters for coefficients of variation were obtained: each cluster corresponds to a score level, represented in Table 2 by a number in brackets ranging from 1 to 3. The last column in the table contains the p values obtained through a Kruskal–Wallis test: the p values lower than the 0.05 threshold, which reveal a significant difference between the two datasets, are highlighted in bold.

Table 2 Mean values, coefficients of variation and Kruskal–Wallis p values for the UMUX questionnaire items

The mean scores computed on all the 84 subjects are in cluster 2, which represents an intermediate level: this dataset represents a mean between the scores collected for the Vive controller and the scores collected for the Myo armband. The mean scores obtained for the Vive controller experience are in cluster 1, which represents the highest level, whereas those obtained for subjects using the Myo armband are in lower rank clusters. In particular, the most important difference can be noticed on items 3 and 4, which fall into the lowest level for the Myo armband, represented by cluster 3: this suggests both the perceived ease of use and reliability are clearly worse for the Myo armband. This significant difference is confirmed by the Kruskal–Wallis nonparametric test, whose p values are lower than the 0.05 threshold. On the other hand, the test reveals no significant difference on items 1 and 2, which express the user’s satisfaction and level of frustration: in this case, when comparing the Vive controller and the Myo armband, the average scores simply move from the highest to the intermediate level.

The coefficients of variation, reported in the second row within each cell of Table 2, represent the level of disagreement of users’ opinions. In the dataset of the scores of all the 84 subjects, the coefficients of variation of items 2 and 4 are in the highest level cluster: this reveals a significant disagreement on level of frustration and reliability. Users’ opinions are highly discordant on these two factors especially in the Myo armband scenario, where there is also a high variability in the perceived ease of use. A high variability about the level of frustration can be noticed also in the Vive scenario, but in this context users seem to strongly agree on a good ease of use. Probably the discordant opinions on ease of use and reliability in the Myo armband scenario can be explained by the different performance of the device in terms of responsiveness, which tends to vary according to the forearm morphology, unlike a handheld device like the Vive controller that exhibited almost the same performance for all the users. On the other hand, the high variability on the level of frustration in both the Vive controller and the Myo armband scenarios suggests there are also other factors, not related to the perceived ease of use and not depending on the interaction modality, that contribute in generating different opinions in users.

The significant differences on items UMUX3 and UMUX4 seem to be confirmed by the charts in Fig. 3, which combine density and histogram plots to compare the distributions of item scores in the Vive and Myo datasets: high scores (5 and 6) are predominant in both Vive and Myo datasets for items UMUX1 and UMUX2; on the other hand, the maximum score (6) is highly predominant in Vive datasets for items UMUX3 and UMUX4, whereas in the Myo dataset a more moderate score (4) is predominant for item UMUX3 and no particular score is clearly predominant for item UMUX4.

Fig. 3
figure 3

Density plots for the four UMUX items

8 SUS factors: usability and learnability

Table 3 reports the mean values and the coefficients of variation computed for SUS usability and learnability components on the scores collected for all the 84 users, for subjects using the Vive controller and for subjects using the Myo armband. For both interaction modalities, usability has a higher mean score than learnability. Moreover, the higher coefficient of variation for learnability suggests that users have more discordant opinions about the ability to learn how to use the system quickly and independently. The last column of the table reports the p values of the Kruskal–Wallis test carried out to compare the Vive and the Myo datasets: only for usability the p value reveals a significant difference between the two interaction modalities.

The k-means method applied to SUS factors detected two clusters for mean scores and two other clusters for coefficients of variation, represented in Table 3 by the numbers in brackets. In both interaction modalities, the usability mean scores fall in the higher level (represented by cluster 1), whereas the learnability mean scores are in the lower level (represented by cluster 2).

Coefficients of variation highlight a significant variability in learnability scores, especially in the Myo scenario, where some users probably supposed that responsiveness and accuracy issues in gesture detection were caused by a poor familiarity with the device.

Table 3 Mean values, coefficients of variation and Kruskal–Wallis p values for the SUS components

The charts in Fig. 4 compare the distributions of Vive and Myo datasets for each SUS component: the maximum score (4) is predominant in Vive usability, while there is no clear score dominance in Myo usability; on the other hand, in Vive learnability a moderate score (3) is predominant, followed by an intermediate score (2.5), while Myo learnability scores are mainly divided between moderate (3) and intermediate (2) values.

Fig. 4
figure 4

Density plots for usability and learnability SUS components

8.1 Usability and learnability items

Table 4 reports the mean values and the coefficients of variation computed for each single item of the SUS questionnaire.

For items 4 and 10, which make up the learnability component, a significant difference can be noticed only on the former, which has also the lowest mean values among all the SUS items in both the interaction modalities: users think there are not many things to learn to use the system, but they think they need the support of a technical person, especially to interact through the Myo armband; however, the very high coefficient of variation (\(\simeq\)0.689) suggests a great disagreement on the latter point.

Table 4 Mean values, coefficients of variation and Kruskal–Wallis test for SUS scores and their relative errors

The mean values in Table 4 were grouped in 4 clusters, as suggested by the NbClust package, through the k-means method. Table 5 reports the 4 clusters in decreasing order, corresponding to 4 score levels. The items for which the Kruskal–Wallis test revealed a significant difference between Vive and Myo are highlighted in bold.

Item 3, which represents the ease of use in the strictest sense just as item 3 of the UMUX questionnaire analyzed in the previous section, is greatly influenced by the interaction modality: it achieves the highest score level (1) for the Vive controller, but it drops two levels for the Myo armband, when it reaches level 3. The corresponding p value, which is lesser than the 0.05 threshold, confirms there is a significant difference in terms of ease of use between the two interaction modalities. This difference is probably related to a better familiarity with handheld controllers and some reliability issues of the Myo armband, which showed a different accuracy in gesture detection based on the forearm morphology of users wearing it. However, the second aspect seems to take a back seat for users, since no significant difference between the two interaction modalities can be noticed in terms of system consistency, represented by item 6.

Also, the mean values of items 8, 9 and 10, which deal with encumbrance, the confidence acquired with the system and the practice needed to become confident with it, are influenced by the interaction modality, although to a lesser extent, since each of them drops just one score level: however, only on item 9 the Kruskal–Wallis p value suggests a significant difference between the two interaction modalities.

Table 5 Score levels for the SUS questionnaire items

For the coefficients of variation of SUS items, the NbClust package suggested 3 clusters, corresponding to 3 score levels reported in decreasing order in Table 6. The k-means method produced the 3 clusters represented in Table 6. The items in bold are those with a different score level for the coefficient of variation between the Vive and Myo datasets. They suggest that opinions on the consistency of the system (item 6) and on its frequent use (item 1) in the Myo scenario are more in agreement than in the Vive scenario. On the other hand, opinions on system complexity (item 2), ease of use (item 3), need for support (item 4) in the Vive scenario are more in agreement than in the Myo scenario. In particular, the huge disagreement on the need for support is the main cause of the high variability in learnability scores highlighted in the previous subsection.

Table 6 Levels of disagreement among users about the SUS questionnaire items

9 PQ: presence

The Presence Questionnaire (PQ) (Witmer and Singer 1998) focuses on the evaluation of the sense of presence in virtual environments. It consists of 24 items with seven response options, but only 19 items were considered in this study, while items about audio and haptic interaction, which are not included in the current version of our simulator, were discarded.

The scores of items 14, 17 and 18 are computed by subtracting the scale value chosen by the user from 7 (since for such items higher values denote worse opinions expressed by the users). The scores of all the remaining items are computed by subtracting 1 from the scale value chosen by the user.

9.1 PQ factors

PQ items can be grouped into four factors (Witmer et al. 2005) representing involvement (items 1, 2, 3, 4, 5, 6, 7, 10 and 13), interface quality (items 17 and 18), adaptation/immersion (items 8, 9, 14, 15, 16 and 19) and visual fidelity (items 11 and 12).

As suggested by the NbClust package, three clusters for mean scores and three other clusters for coefficients of variation of PQ components were created through the k-means method. In this way, three different score levels were obtained: they are represented by the numbers in brackets in Table 7.

Table 7 Mean values, coefficients of variation and Kruskal–Wallis p values for PQ components

A Kruskal–Wallis nonparametric test was performed on these four PQ components: the p values lower than 0.05 for involvement and adaptation/immersion revealed a significant difference for these two factors between the two interaction modalities.

The charts in Fig. 5 compare the distributions of Vive and Myo datasets for each PQ factor. In the Vive scenario, most involvement scores achieve high values in the range between 5 and 5.8, while in the Myo scenario most of the scores are distributed within a wider range between 4.2 and 5.5. In the Vive scenario, adaptation/immersion scores are mainly concentrated around the values of 5.5 and 4.8, while in the Myo scenario there are no particular dominant values or ranges. These differences in the charts reflect the significant difference between Vive and Myo for involvement and adaptation/immersion components. Apart from visual fidelity, which is obviously not affected by the interaction mode, interface quality also has very similar distributions between the two interaction modes, except for a clear predominance in the Myo scenario of values around 5.

Fig. 5
figure 5

Density plots for the four PQ components

9.2 PQ items

Tables 8 and 9 report the mean scores and the coefficients of variation computed for the 19 items of the PQ questionnaire. The p values of the Kruskal–Wallis test, reported in the last column of the tables, revealed a significant difference between Vive and Myo on items 1, 2, 3, 5, 14, 15, 16 and 19. In particular, items 1, 2, 3 and 5 belong to the involvement factor, while items 14, 15, 16 and 19 belong to the adaptation/immersion factor.

Table 8 Mean values, coefficients of variation and Kruskal–Wallis p values for the PQ questionnaire (items 1–10)
Table 9 Mean values, coefficients of variation and Kruskal–Wallis p values for the PQ questionnaire (items 11-19)

To focus only on significant differences among mean values, PQ items with similar mean scores and with similar coefficients of variation were grouped together through k-means clustering. The NbClust package was employed to estimate the relevant number of clusters. Two different clusterings, reported in Tables 10 and 11, were obtained: the former groups mean values into four score levels, whereas the latter groups coefficients of variation into three score levels.

Table 10 Score levels for PQ items

The items for which the Kruskal–Wallis test suggested a significant difference are highlighted in bold in Table 10.

PQ items 1, 2, 3, 5, 14 and 16 are the most influenced by the interaction modality, since their scores downgrade by two levels when the Myo armband is employed in place of the Vive controller: in particular, items 1, 2, 3, 5 and 14 downgrade from level 2 to level 4, while item 16 downgrades from level 1 to level 3. On the other hand, the scores of items 15 and 19 downgrade only by one level: the former from level 1 to level 2, the latter from level 3 to level 4: therefore, the interaction modality has only a moderate influence on the adaptation in a strict sense (i.e., the ability to adjust to the virtual experience, represented by item 15) and on the ability to concentrate on tasks and activities (item 19). However, the adaptation/immersion factor in a wider sense also includes items 14 and 16, which are heavily influenced: the former represents the delay experienced between actions and outcomes, whereas the latter refers to the proficiency in movements and interactions. Items 8 and 9, which also belong to the adaptation/immersion factor, do not present significant differences according to the Kruskal–Wallis test: the former expresses the ability of users to anticipate the response to their actions, whereas the latter represents the ability to survey and search the environment.

PQ items 4, 6, 10, 11, 12 and 13 remain in the highest score level for subjects using the Myo armband, despite the more difficult interaction experienced with such device. In particular, it is important to underline that the sense of object moving through space (item 6) and the ability to examine objects closely (item 11) and from multiple viewpoints (item 12) are not influenced by Myo’s usability issues, even though such activities can be carried out also by giving commands through the Vive controller or by performing Myo’s gestures to zoom or rotate the objects. A possible explanation is that users were more impressed by the possibility of walking toward the virtual objects to get closer and take a look inside them: this modality of examining the virtual objects probably prevailed over the interaction through the Vive controller or the Myo armband.

PQ item 13, representing involvement in a strict sense together with item 4, is not influenced by the interaction modality, but it is worth highlighting that in a wider sense the involvement factor also includes items 1, 2, 3 and 5. Therefore, the involvement factor is affected only on its two subcomponents dealing with control, made up of items 1 and 2, and natural interaction, made up of items 3, 5 and 7. A heavy influence can be noticed on the control subcomponent, since the scores of both its items downgrade by two levels. The influence is strong also on the natural interaction subcomponent, due to the scores of items 3 and 5 downgrading by two scales, even though item 7 is not significantly influenced by the interaction modality: naturalness in interaction and in movement control is considerably lower in the Myo armband scenario, but the consistency of the virtual experience seems not greatly affected by Myo’s usability issues.

The interface quality factor includes item 18 about the interferences of the control devices: however, the Myo armband did not generate a higher interference than the Vive controller, even though the high coefficients of variation suggest that users have discordant opinions on both the Vive’s and the Myo’s interferences, as well as on the ability to concentrate on tasks, represented by item 19.

In general, the differences in terms of coefficients of variation between the two interaction modalities (Table 11) are not so pronounced, since all the items remain at the same level or downgrade at most by one level. The variability of the scores for PQ items 2 and 14, which concern the responsiveness of the environment and the delay between the user’s actions and the expected outcomes, is slightly higher in the Myo scenario. This was probably caused by the unpredictable behavior of the Myo device, whose performance varies depending on the morphology of the user’s forearm: in particular, during the tests the Myo device sometimes failed in detecting some gestures (especially spread fingers, the one on the bottom right in Fig. 1, used to move the plane of the CT slices) when it was worn by thin forearms, with a low muscle and tendon tone. Users experiencing such issues often tried to repeat the same gesture or thought there was a delay in gesture detection. For this reason, a higher coefficient of variation can be noticed on item 16, which suggests a higher variability in the interaction proficiency perceived by Myo’s users.

Even though the Kruskal–Wallis p values do not highlight any significant difference between the two interaction modalities on the level of consistency of the virtual experience (item 7) and on the ability of users to anticipate the responses to their actions (item 8), a slightly higher variability can be noticed in the Myo scenario also for these two items.

Table 11 Levels of disagreement among users about PQ items (based on the coefficients of variation)

10 Discussion

The interaction modality influences several aspects of the user experience in a virtual environment. The involvement factor is the one with the highest number of items influenced by the interaction mode, even though the bar charts in Fig. 5 show a greater difference in the distribution of scores for the adaptation/immersion factor.

In the Myo armband scenario, users perceived a lower ability to control events (PQ item 1), a lower responsiveness of the system (PQ items 2 and 14), less natural and less proficient forms of interaction (PQ items 3, 5 and 16), a greater difficulty in adapting to the virtual experience (PQ item 15) and a greater distraction generated by the mechanisms used to perform activities (PQ item 19). Nevertheless, they did not perceive any sensible degradation in task performance.

The SUS questionnaire can be decomposed into two factors (Lewis and Sauro 2009), namely usability and learnability, which are weakly correlated (Borsci et al. 2009). In particular, the learnability factor is made up of items 4 (“I think that I would need the support of a technical person to be able to use this system”) and 10 (“I needed to learn a lot of things before I could get going with this system”), whereas item 7 (“I would imagine that most people would learn to use this system very quickly”), contrary to what it may seem, belongs to the usability factor together with the remaining items, probably because it refers to the skills of other users (Lewis and Sauro 2009). A previous work (De Paolis and De Luca 2020) highlighted Vive’s controls and Myo’s gestures have a similar level of intuitiveness, with no significant difference between the two devices in terms of learnability, despite Myo’s inferior usability. Now, the more detailed analysis presented in this paper on the two learnability items and usability item 7 reveals interesting insights into the structure of the learnability component and how the user sees his/her performance in relation to that of other users. A user generally thinks he/she is the one who has the main problems with Myo, because he/she believes that there are not many things to learn, although he/she feels the need to be helped by an expert, and thinks that it is easy for other users to learn how to use the system. This also suggests that the learnability component of SUS can be further decomposed into two sub-components: one component, closer to item 10, represents the amount of things to be learned while the other, closer to item 4, concerns the practice needed under the guidance of an expert to master and apply them correctly. There are not many gestures that need to be learned to interact with Myo, but sometimes they require practice before they can be performed accurately. During the experiments, it was actually the device that sometimes exhibited problems in terms of input veracity: in particular, the “spread your fingers” gesture, which allows translating the plane of the CT images, was sometimes interpreted by Myo as “wave left” or “wave right” because some users did not stretch their fingers enough when opening their hand. An earlier study (Dolopikos et al. 2021) had explained this ambiguity by pointing out that finger abduction, required to spread fingers, involves the flexor carpi radialis muscles, on which the wrist flexion necessary for the wave-in gesture is also based. In light of this, to avoid usability problems, touchless interaction should be based on a vocabulary of gestures that are well differentiated and clearly distinguishable: the design of the device should be careful to avoid ambiguities that could derive from particularities in the conformation of the user’s limbs (e.g., too thin arms in the case of devices based on electromyographic pulses).

However, the significantly lower score on PQ item 19 for the Myo device, which indicates a reduced level of concentration, could also refer to a decrease in focus of attention, caused by certain gestures requiring a higher degree of abstract reasoning. The gestures detected by the Myo armband, designed according to the principle of operation of the device (electromyographic pulse detection), might not be perceived as a natural form of interaction. Ideally, however, gesture design should be oriented toward the principle of guessability (Wobbrock et al. 2005), which would allow users to master their use without the need for prior knowledge or training. This could include the definition of gesture vocabularies based on user preferences and attitudes (Dong et al. 2016).

Also during the tests, some of the users experiencing problems using the Myo armband usually tended to repeat gestures, probably because they could not see the results of their actions (the so-called gulf of evaluation mentioned in Sect. 3.3) and thought these issues were related to a delay in gesture detection (as reflected in the lower average score for PQ item 14).

Anyway, the lower ease of use (UMUX item 3 and SUS item 3) and reliability (UMUX item 4) of Myo do not generate frustration among users (UMUX item 2), who do not perceive any serious inconsistencies (SUS item 6) and remain generally satisfied with the system (UMUX item 1), probably because the fascination with virtual reality outweighed the difficulties related to the use of the device. In the experimental scenario considered in this work, the user was left completely free to explore the environment without time constraints, trying all the available interaction modes. It would be interesting to assess how the opinions expressed by the users would change in the presence of tasks to be performed within certain timeframes (as in the case of action video games) and how haste might amplify the perception of problems.

It is possible to speculate that leaving the user free to explore the environment autonomously at will, without any particular task to be completed, allowed curiosity and attraction toward the VR environment to take over any difficulties. Such activities can fall within the context of informal learning (Lin et al. 2012): unlike institutional education, it relies on intrinsic motivations, such as personal curiosity, which make it an enjoyable activity.

Moreover, it is likely that the possibility of approaching the organs by walking and exploring them from the inside partially compensates the possible shortcomings of the input device. In general, the possibility to walk in a virtual environment with the same movements as in the real world improves presence, ease of navigation, spatial orientation and movement understanding (Usoh et al. 1999; Chance et al. 1998). Therefore, in the considered scenario, the device used for interaction affects the ability to concentrate but not performance, as users still manage to complete the tasks they undertake. On the contrary, the unreliable behavior of the Myo armband could interfere also with the performance of tasks or the achievement of set goals, generating a greater sense of frustration which would also have a negative impact on user involvement.

11 Conclusions and future work

This paper focused on the usability and the perceived presence in a virtual environment: starting from the findings of a previous work (De Paolis and De Luca 2020), the differences between the HTC Vive handheld controller and the Myo armband touchless device were investigated. The analysis considered both traditional usability, covered by the items of UMUX and SUS questionnaires, and user experience in a virtual environment, assessed by means of the PQ questionnaire. The tests revealed a significant difference in terms of user experience between the two devices and a relevant impact on the sense of presence: we analyzed in detail which components are influenced by the mode of interaction and how much the users’ opinions agree on each of them.

In the considered scenario, the user was left free to interact as he/she liked, combining the interaction through Vive controller or Myo armband with walking in the virtual environment. In this context, it has been seen that the use of the Myo armband partially distracts users from the tasks of exploring the virtual environment, but without generating frustration, probably because the fascination of the experience in the virtual environment tends to prevail over possible problems related to the functioning of the interaction device. Compared to the interaction based on a traditional controller, the Myo-based interaction suffers from problems in both the detection accuracy and the naturalness of abstract gestures.

Another interesting finding is that users feel that they do not have much to learn in order to use the Myo armband, but need the support of an expert, so they probably have more difficulty practising with the device than understanding the gestures: this may suggest that there is a slight prevalence of problems related to accuracy of detection over problems related to intuitiveness of abstract gestures.

For future experiments, we will evaluate the possibility of assigning precise tasks to the user, defining a workflow of actions that he/she will have to perform through certain gestures. In this way, it will be possible to force the user to try at least once with all the interaction possibilities, avoiding that he/she adopts, for example, the walk to approach a virtual object when he/she is not able to perform the zoom gesture.

Then, it will also be possible to establish timeframes within which the user will have to perform the various actions or to assign scores after the achievement of certain objectives. In this way, it will be possible to evaluate how the impact of the different interaction modalities varies on the user experience both in case of guided actions and in presence of gaming policies.

The study could be extended to a wider range of more articulated gestures through the use of camera-based touchless devices, such as the Leap Motion controller (Leap Motion 2021), which can track hands and fingers, or Microsoft Azure Kinect (Microsoft 2021a), which can detect movements of arms and legs. Furthermore, the considered experimental scenario could be ported in a mixed reality environment by adopting Microsoft Hololens (Microsoft 2021b), a headset able to detect hand gestures. These devices will enable the possibility of implementing gestures to grasp virtual objects, which will be compared with the abstract semaphoric gestures addressed in this study in terms of usability and impact on the sense of presence. Moreover, by recording signals through an EEG headset, we could study the perception of affordances and make a comparison with the impressions gathered through the questionnaires.