After providing descriptive results of the user study, we provide outcomes of the regression analysis to estimate the relationship of head yaw on the questionnaires’ results, which reflect the participants’ affective states. In turn, we used the discovered measures to derive a prediction model to assess the suitability of the relationship to perform predictions.
In the user study, we collected eight self-assessed measures (see Table 1). The results for presence were on average 4.75/7 (SD = 1.07), which we consider reasonable. This is also confirmed by other recent VR studies yielding presence values in the same range (see Chang et al. 2019; Zenner et al. 2020). The average mental demand was 3.14/10 (SD = 2.01), which is rather on the lower end. The elevated SD for mental demand could be explained by the fact that each user was perceiving the VE differently, which is due to a combination of individual traits (Jensen and Konradsen 2017). This is further supported by subjective feedback from the user study, where some participants reported that they had to concentrate considerably while being exposed to the VE, while others reported that they did not have to concentrate at all. The results for the average physical demand were 0.86/10 (SD = 1.00), which is unsurprisingly low. As the participants proceeded through the VE in a seated setup and only interacted with the VE by pushing the trackpad on the HTC Vive’s controller, they were limited to mainly head movements. This could be also one of the reasons for the participants’ low average effort of 2.60/10 (SD = 1.93). The average temporal demand was 1.88/10 (SD = 1.77), which is also rather low. This could be explained by the VE’s design, as the participants were able to navigate through the training in their own pace without any time constraints. The average perceived performance was 8.17/10 (SD = 2.02), which is on the upper end, implicating that participants felt successful in completing the virtual training session. This can be well explained through the VE’s task, which required solely to push the HTC Vive’s controller trackpad to proceed through the application. Consequently, there was little possibility for wrong actions that might decrease a user’s perceived performance. Finally, the participants’ average frustration was 2.36/10 (SD = 2.5), which was also rather low. However, the SD for frustration is considerable, which could be explained by the impossibility to return to a previous step during the training, since some participants provided the subjective feedback on being frustrated about not being able to return to a previous step in the VE. The VEs average SUS was 82.68/100 (SD = 11.02), which can be generally classified as good usability (Bangor et al. 2009).
We conducted a regression analysis to investigate on the relationship of head yaw and the users’ affective states, measured through the questionnaires. As the dependent variables, we use the following head movement features computed from head yaw, which can be also described as the head’s rotation around the vertical axis (z-axis): mean (Me) and standard deviation (Cv) of angular displacement Won et al. (2016), as well as mean of the angular speed (Vh), which was also used in Pfeuffer et al. (2019). We then created a linear regression model for each of the independent variables: presence, the six NASA TLX items (mental demand, physical demand, temporal demand, perceived performance, effort, and frustration), and the SUS. These eight linear regression models are shown in Table 2.
Results from the regression analysis indicate our metrics, the dependent variables, to be significant predictors in five models, i.e., mental demand, physical demand, perceived performance, presence, and usability. The exceptions are models for temporal demand, effort, and frustration. Possible reasons on why these three models did not show significant relationships could be that there is only a weak relationship being significant for larger sample sizes or that there is simply no relationship, e.g., due to the setup of the task. In our study, the self-paced nature of the task might reduce the relevance or severity of temporal demand (also indicated by a low average). We found that for four out of five of the independent variables at least two dependent variables showed a significant influence. The adjusted \(R^2\) of the regression models shows that head movement features are shown to be able to explain up to 25% of the variance in users’ answers. In general, users’ head movement features appear to be most predictive of users’ perceived performance in the task done in the VTE. The models also showed indications of relationships between the metrics with usability, mental and physical demand, on lesser degrees. Presence is shown to only have a weak relationship with the selected metrics. While this might be due to the metric selection, another factor could be that perception of presence among users varies more than their perceptions on other measure. For example, users might agree more on the meaning of a 3 out 10 in terms of physical demand, rather than what 3 out of 7 in terms presence means. Further discussions of the five models are presented in the following paragraphs.
Mental demand Users who let their gaze wander to cover a wide area of the VE, e.g., to find relevant information, and those who moved their head slowly indicate high mental demand. Covering a larger area, i.e., a high Cv value, relates to having to digest and/or gather more information, which leads to more information to process and in turn to an increase in mental demand. A higher angular speed (Vh), indicates that users were likely more certain where they want to direct their attention to. Such users are then more likely to report a lower mental demand, as they were more certain throughout the task than users who moved slower.
Physical demand The coefficients of predictors for physical demand show a similar pattern like for the mental demand. That is a positive coefficient for coverage (Cv) and a negative coefficient for angular speed (Vh). In this case, speed is less negatively related, which is intuitive, since in fact a larger speed might have caused an increased physical demand. On top of that, the range of the area covered by head tilting, as indicated by a high Cv value, seems less important in predicting physical demand than mental demand. The model for physical demand, however, shows a lower adjusted \(R^2\) value than the one for mental demand. This could be due to the fact that physical demand may be better reflected in other types of movements, e.g., torso or limb movements, whereas increased head movements might reflect the mental activity of the users.
Perceived performance Users who move their head faster (large Vh) may reflect their certainty in what they are doing. This resulted in an increased confidence that they did well on the given task. A negative impact from having a large coverage (Cv) is the need to search for information extensively, which is indicating uncertainty. While having the need for more information is not necessarily negative, it might still indicate feeling more challenged, i.e., being less optimistic on their success. Finally, the users’ average tilting direction (Me) is also shown to be a significant predictor for perceived performance. The negative coefficient indicates that users, who tilted more to the right reported lower perceived performance. This could be a consequence of our VE setup, where the instruction board and table for components are located to the right of the 2D plan of the final assembly and the final assembly itself. Users who looked more to the right relative to their peers could be the ones spending less time on the task assembling than reading instructions or examining components on the table, and reported higher perceived performance.
Presence Our regression model indicated that the Me is a significant predictor for their sense of “being there,” i.e., the presence score. The negative sign of the coefficient indicates that users who tilted more to the right tend to report lower presence score. The corresponding coefficient of \(-0.111\) shows that one degree of yaw towards the right contributes to a lower presence score of about a tenth of a point in the presence. Thus, users who on average tilted their heads ten degrees to the right-of-center, are estimated to report one point lower in presence compared to those who tilted their head left and right equally, i.e., their averaged tilting would be at the center. As shown in Table 1, presence scores range from 0 to 7 with a mean and standard deviation of 4.79 and 1.07, respectively. This implies that a user, who tilted 10 degrees more towards the right shoulder than the average user is likely to report a presence score one point lower than the average user. With the training setup within the VE having points of interest on the users’ left-hand side and none on their right-hand side, tilting towards the right shoulder might indicate users’ decreased attention and missing involvement during the training.
In a review by Yaremych and Persky (2019), findings from studies which used tracing data of physical behavior, such as ours, could be influenced by the specifics of the VE used for the study. Thus, while there might exist general relationships between behavioral data and affective states, they must be carefully assessed in multiple settings to allow for generalizations. In the following paragraphs, we set forth a few possible explanations and relationships that might hold beyond our study. These explanations could serve as starting points for future studies on tracing physical behavioral data, in particular that of head yaw.
Usability A large positive coefficient on Vh possibly indicates that users who moved faster are more at ease with using the VE. Slower movements, on the other hand, could be due to users having difficulties in processing information presented through the VE. This is aligned with the negative coefficient on coverage. That is, users’ who felt less need to look around and gather information within the VR would deem that the system has high usability.
We evaluate the suitability of the features extracted from head yaw data in predicting coarsely users’ questionnaire responses as follows. First, we selected the questionnaires, i.e., the independent variables, of interest based on our regression in Sect. 5.2. That is, we took the five independent variables which are shown to have significant predictors: presence, mental demand, physical demand, perceived performance, and usability. We then divided the response values into two groups based on the median. This is done in order to achieve groups with balanced sizes. The distribution of the resulting groups, namely less than median value and greater or equal to median value are presented in Fig. 3
To evaluate the prediction models, we employed the leave-one-out cross-validation approach. That is, we leave one sample out, train the model with the rest \(n-1\) samples, and test the model using the left-out sample. This procedure is done for each sample, i.e., n times. We report the cross-validation accuracy, i.e., the average accuracy of all n samples.
Our evaluation included two methods to perform the prediction task, namely linear regression and decision tree. For the decision tree we used the scikit-learn (Pedregosa et al. 2011) implementation. We set the maximum depth at 3 to improve the model’s simplicity, and left the other hyperparameters at their default setting. A commonly applied baseline to infer whether a user belongs to one group or the other is to guess based on the proportion of group sizes. That is, every user is predicted to belong to the majority group. We used this baseline to evaluate the performance of both prediction models.
Outcomes Table 3 shows the cross-validation accuracy for each model on each measure. Overall, the decision tree performed best. It outperformed the baseline for 4 out of 5 measures, whereas the difference was larger than 10% for physical demand and usability. For perceived performance, simple linear regression did better than the baseline and clearly outperformed the decision tree. This outcome is aligned with our regression analysis, where the model on perceived performance had the highest adjusted \(R^2\) value. Presence and mental demand did not yield satisfactory prediction accuracy compared to the baseline. It is also no surprise that no single model performs best at predicting all five measures given the “no free lunch” theorem (Wolpert and Macready 1997), saying that there is no single best model for all tasks.
While most users administer questionnaires genuinely, some might not answer properly (Mertens et al. 2017). This manifests in the intentional or unintentional provision of incorrect answers, which decreases data quality and could even lead to drawing wrong conclusions. While the intentional provision of wrong answers can be related to users’ low motivation to invest time and effort, e.g., in a crowdsourcing setting participants might be monetary driven and data screening might be needed to eliminate poor responses (Chmielewski and Kucker 2019). However, even the motivated users might unintentionally provide wrong answers. For example, users might exhibit an optimism bias that manifests in answers by underestimating mental effort and overestimating their performance. That is, users might require a lot of mental effort to address a task in a VE, but report low scores. Furthermore, users might also interpret scales differently. This would result in some users answering, such that the answers across many questions are Gaussian distributed, i.e., only few answers will consist of the minimum or maximum value. Other users might tend towards more binary answers, where answers tend to lean more towards extreme values. Some users might also exhibit central tendency bias.
To assess the ability to detect manipulated answers, we assume that (most) users answered correctly. That is, if we alter answers of users and there is a relationship between behavior and users’ self-reported affective states, we should be able to detect it. Accordingly, we assess if we can detect when the actual reported answer of a user is altered. The smallest unit of possible change depends on the range of the VR measure. We defined it as follows:
Presence scores (ranged 1–7): 1 unit corresponds to 1 point.
Mental Demand, Physical Demand, and Perceived Performance scores (ranged 0–10): 1 unit corresponds to 1 point.
Usability scores (ranged 0–100): 1 unit corresponds to 15 points.
The difficulty of this task depends heavily on the magnitude of manipulations, e.g., changing an answer from score of 10 to 0 is much easier to notice than from 10 to 9. Therefore, we investigate three levels of manipulation: (1) minor: answers are increased or reduced by the smallest amount, i.e., just one unit; (2) medium: changed by two units, and (3) major: changed by three units.
It is also ensured that the manipulated answers are neither above the maximum possible value nor lower than the minimum possible value, e.g., the smallest possible is always increased.
To assess if a user’s answer is manipulated, we assess the prediction of a model (trained without the user data) and compute the difference in prediction and user answer. We classify an answer as incorrect if the prediction error is larger than a threshold. For each user, we generate one manipulated answer and also use the actual reported answer. That is, there is an equal number of manipulated answers and “truthful” answers. A “guessing” approach would therefore obtain 50% accuracy. Two types of models are evaluated for this detection task, namely linear regression and regression tree. For the regression tree, we kept the maximum depth at three to maintain model’s simplicity. Figure 4 shows the cross-validation accuracy of both models for each manipulation level. In general, the larger the manipulations, the better the models perform better than guessing, i.e., 50%.
Results Both models behave qualitatively similar. Minor manipulations could only be detected clearly above baseline for presence and usability. However, the larger the manipulations the higher the accuracies that can be yielded. For large manipulations, the accuracy ranged from about 60% up to 85% depending on the measure and model. This indicates the suitability of the approach, in particular, since the data used to infer the model might be “noisy,” i.e., contain user replies that are wrongful but assumed to be correct. It is also no surprise that perceived performance and mental demand are the most difficult to predict. Table 1 one indicates the standard deviations of responses. There, these two measures exhibit relatively (compared to their min-max possible reply) the largest standard deviation. That is, these replies are most “noisy,” indicating that it is fairly plausible that a user’s reply might fluctuate within the order of manipulation. In turn, this makes it difficult to distinguish manipulation from natural variation.