This study focuses on how people perceive sizes of virtual contents using immersive display technologies with true to scale data visualization. Therefore, size judgment performance is compared between three scenarios, two of them showing to scale data representations on a LED floor and one showing relative-sized visualizations on a tablet computer.
Nevertheless, using LHRD technology human perception has not been in basic research focus extensively. Bezerianos et al. presented a call for research  on the perception of data on wall-sized displays, as there is still little research carried out in this domain: “We do not yet know how the perceptual affordances of a wall, such as the wide viewing angles they cover, affect how data is perceived and comprehended” and call “for more studies on the perception of data on wall-sized displays.” Using different types of display devices directly influences the spatial perception, visual space, and the control of spatial behavior, especially when using display arrangements such as a LED floor. We follow this call for research and add an additional element: Data visualization on large scale augmented floor surfaces showing contents in absolute scale.
In the purely physical domain, size and distance judgments have been in the focus of literature for a long time. In 1963, Epstein  presented the key findings that distance and size judgments are not systematically related and deviations of size judgments varied with distance. Later, Epstein and Broota  presented a further evaluation on the judgment of sizes and distances and the corresponding reaction times. They found a positive correlation between viewing distance of objects and the reaction time. In Wagner’s publication, “The metric of visual space” , he gives insights into judging distances, angles, and areas as conducted in this study. Cleveland and McGill present groundbreaking works in the visual decoding of information, namely graphical perception. They present a set of elementary perceptual tasks working and how people extract quantitative information. More recently, Talbot et al. pick up these works and analyze the reasons for the differences in perception of charts .
For virtual environments, broad research is carried out on perceived spaces in VR, such as distances, sizes, speeds, and spaces. Loomis et al. showed that egocentric distance judgments in physical environments nearly match 100% of the actual distance , whereas in virtual environments, they are frequently underestimated. Renner et al. presented a literature review and summarized that a “mean estimation of egocentric distances in virtual environments of about 74%” . Renner et al. also clustered possible influence factors for this under perception of sizes in four different clusters: measurement methods, technical factors, compositional factors, and human factors. In contrast, current state-of-the-art head mounted displays seem to ameliorate these effects . Kelly et al. showed when using modern HMD devices, this effect is reduced but has not been completely resolved. In comparison with the literature, no relative size judgment has been carried out in VR by providing the user’s with relative scales.
To the best of our knowledge, we are the first to execute size judgment experiments using a large-scale LED floor setup in comparison with a small-sized baseline measurement.
Study goal and predictions
One of the striking benefits of a large-scale displays is the possibility of visualizing true to scale data, contents, or virtual scenes. In the context of the presented use case within the automotive industry, 3D contents with individual view points have been intentionally excluded, whereas 2D representations (see Figs. 7 and 1) have been chosen for this study, since the aforementioned use cases are limited to data visualization of 2D data.
This evaluation gives insights if people can assess sizes of 2D contents more accurately and precisely if they are shown in true to scale compared with relative-scaled representations. The baseline scenario represents relative-sized visualizations on a tablet computer, showing exactly the visual cues as in the true to scale scenarios. In this study, size judgment refers to the edge length estimations. To date, there is no published research documenting the extent to which true to scale floor content supports people in estimating sizes using augmented floor surfaces. To address these issues thoroughly, this study employs verbal distance judgments and objective measurements. Four different aspects are evaluated in this study:
Accuracy: Is there a systematic overestimation or underestimation (accuracy) of size judgments? (Mean absolute percentage error, see Armstrong and Collopoy )
Precision: In which scenario participants achieve the most precise size judgments. (SD of mean absolute percentage error).
Task completion time: Is there a difference in task completion time for the three different scenarios? (Objective time measurements)
Qualitative feedback: Are the user’s subjective size judgments on precision and task completion time matching the objective measurements? (Non-standardized questionnaire)
For this study, 22 voluntary participants were randomly selected, such as production engineers, research engineers, PhD candidates, and students from different production planning departments in manufacturing industry. Fifteen males and 7 females were taking part, all ranging from 21 to 57 years. (M= 31.57, SD = 11.52). All participants reported normal to corrected vision and chose the metric system as their preferred unit.
Setup, stimuli, and design
Three different modes of perception are evaluated. For all three scenarios, the same visualization software, visual cues, and interaction (besides user’s movement) are used, only the output modality is changed (see Fig. 8):
Tablet scenario (T): Relative-sized visualizations as a baseline
Floor scenario (F): True to scale visualization restricting user’s viewpoint on the side of the LED floor
Floor and Interaction scenario (FI): True to scale visualization allows user’s movement on the whole LED floor
The rendering and evaluation software is a custom application which displays virtual squares in a randomized order (six different sequences for 3 scenarios) handling the randomized scenario work flow and logging the evaluation results (square size, square rotation, pixel per meter, scenario completion time). In all three scenarios, the participants are shown 2D white squares on a black background. These squares have randomized sizes from 50 to 200 cm with random positions and orientations (+/–15∘) on the screen (see Fig. 7). Additionally, a virtual ruler represents the absolute length of 1 m and remains at the same position (center bottom) throughout all scenarios. Besides the aforementioned 9-m × 6-m LED floor apparatus with 10.81-m screen diagonal for the scenarios (F) and (FI), scenario (T) is visualized on a 12,3” tablet screen, set to the same aspect ratio as the LED floor. The LED floor pixel pitch is 5 mm.
After signing the informed consent, the participant is given verbal instructions on the goal and evaluation procedure. Each participant executes all three scenarios (T), (F), and (FI) (within-subject design) in a randomized order to abolish learning effects. There is no interaction with the virtual contents, so that the focus is limited to the differences in spatial perception. In each scenario, 20 randomized (size, rotation, position) squares are visualized. After presenting each square, the participants verbally express their size estimate to the experimenter in the unit centimeters. The experimenter writes down the response for each estimation in parallel.
The three different scenarios are depicted in Fig. 8 and described as follows:
Tablet (T): The software visualizes the squares on the tablet computer as relatively sized content. The users have to judge the absolute edge length in relation to the visualized ruler.
Floor (F): The software visualizes the squares on the LED floor to scale. The participant is directly facing the LED floor from a static location (compare ), standing on the outside border, centered on the long edge of the LED floor (3m to the center), and may not access it.
Floor&Interaction (FI): Same setup as in scenario (F), but in contrast, he has the opportunity to move freely on the augmented floor during the study, so that the subject may position himself/herself directly above the respective square.
The experiment has been conducted a total of 22 times with different participants. Each evaluation takes approximately 20 min including the subsequent completion of the questionnaire. A total of 1320 datasets have been collected (22 participants, 3 scenarios, 20 trials) each one containing the actual and reported length [cm], spatial deviation/error [cm], task completion time [ms], pseudonym, scenario, square rotation, and position.
Finally, participants are handed out and asked to fill out a questionnaire after execution of all three scenarios to gather their subjective feedback. They are asked about their personal scenario preferences for direct comparison. In addition, each subject is to select the method which he/she has preferred and specify the reason for his decision.
The results are clustered in the three sections: Accuracy, precision, and task completion time. Spatial deviation is the difference between the actual edge length (ground truth) of the squares and the estimation of each participant for the respective square edge length. Negative values represent an underestimation of size and vice versa.
Figure 9 shows an scatter plot of all three scenarios depicting the true length [cm] over the difference between true and estimated length. All three scenarios show, that in mean, there is only little overall overestimation or underestimation of the user’s size judgments with (T) having a mean of 0.951 cm (SD = 30.204), (F) − 0.634 cm (SD = 22.499), and (FI) − 5.694 cm (SD = 17.850). However, regarding the relatively large standard deviations compared with the small means, the interpretability of the aforementioned spatial deviation is disputable due to overestimation and underestimation. Furthermore, by tendency, the spatial deviation rises with growing edge length of the squares, especially considering (T) and (F). In order to normalize these effects, in the following, the mean absolute percentage error (MAPE) and mean standard deviation (SD) of MAPE for trials within subject are used to evaluate accuracy and precision between all three scenarios.
MAPE is a measure of prediction accuracy. (T) shows a mean absolute percentage error of 14.783% (SD = 5.612%), (F) 11.369% (SD = 4.599%), and (FI) 9.814% (SD = 3.957%). Figure 10 depicts the box plots of the MAPE of all three scenarios. A statistically comparison is conducted considering (T), (F), and (FI). Levene’s test shows that variance homogeneity is given for this data (F(2,63) = 0.942, p = 0.395); therefore, the standard one-way ANOVA can be used in the latter. One-way ANOVA reports statistically significant difference between the three scenarios (F(2,63) = 6.242, p = 0.003). The post hoc pairwise t test with Holm correction reveals that there is no significant difference between (FI) and (F) (p = 0.284), but for both other scenarios (T) and (F) (p = 0.041) and (T) and (FI) (p = 0.003).
Overall, therefore, the MAPE of both true to scale visualization scenarios (F) and (FI) can be regarded as significantly different from the relative scaled (T) scenario. As both mean MAPE values are lower, the scenarios (F) and (FI) have a higher accuracy compared with (T).
The mean SD of MAPE for trials within subject demonstrates the precision of size judgments represented by the “variance of absolute percentage errors.” (T) shows a mean SD of 10.006% (SD = 3.394%), (F) of 9.759% (SD = 6.051%), and (FI) of 8.921% (SD = 7.898%). Figure 11 depicts the SD of MAPE for trials within subject box plots of all three scenarios. Levene’s test is utilized for testing equality of the variances in distributions. With F(2,63) = 0.329, p = 0.721 it shows that variance homogeneity is given for the SD. Therefore, standard-one way ANOVA with post hoc pairwise t test with Holm correction can be used in this case which reports F(2,63) = 0.184, p = 0.832. Since one-way ANOVA shows no significance, post hoc test results are not reported here.
No significant difference in precision can be found using true to scale visualization scenarios (F) and (FI) compared with (T). However, considering the descriptive statistics of mean SD of MAPE for trials within subject, a minor tendency of lower precision of (T) compared with (F) and (FI) is depicted (see Fig. 11).
Task completion time
The participants did neither get any instructions on task execution time nor on the priority between precision and speed. Nevertheless, task completion time has been tracked throughout the experiment. Time measurements have been gathered for every single size estimation in all scenarios, stating when a square is displayed and finishing when verbally passing the size judgment to the study manager.
Participants show a training curve throughout the 20 runs of each scenario. All in all, run 2 to 20, the median of scenario (T) is 5.063 ms, whereas the scenarios (FI) (9.959 ms) and (F) (8.429 ms) are slower. For all three scenarios, the very first runs show a higher median values (see Fig. 12 and Table 2) caused by non-existing training.
After having performed the experiment, all 22 participants filled out a questionnaire on their subjective perception. The non-standardized questionnaire compares the objective metrics with the participant’s subjective perception.
Task completion time
“For this method, I was able to judge the sizes more quickly.” The participants had to decide on each possible pairwise combination of all three scenarios: “(T) or (FI),” “(T) or (F),” “(F) or (FI).” Overall, the subjectively fastest scenario is (T). Comparing the scenarios (F) and (FI), the results are equal (50% vs. 50%). Comparing both floor scenarios (F) and (FI) with the (T) scenario, a subjective time benefit of (T) is reported 72.73% in favor of (T) compared with (FI) and 63% in favor of (T) compared with (F). The subjective questionnaire feedback matches the objectively measured times. 86.67% of the participants were really quicker, when they are in favor of the (T) scenario in terms of task completion time. In contrast to that, only 7.14% of people in favor of (F) or (FI) scenarios were really quicker.
“Using this scenario, I’m able to assess the sizes more precisely.” As for task completion time, all pairwise combinations of scenarios are tested: (FI) is estimated the most precise scenario (46.97%) followed by (T) (31.82%) and (F) (21.21%). Interestingly, people clearly preferred (FI) over (F) (86.36%), whereas when comparing (FI) with (T) and (F) with (T), there is no clear preference (50.00% and 54.55% in favor of both floor scenarios). Comparing those subjective results with objective error metrics, there is a false impression for the subject’s error estimation capability using (T) scenario. Only 28.57% objectively performed more precisely using (T) even though they are estimating this scenario as the most precise one. In contrast, 78.26% of people who are in favor of either (F) or (FI) scenarios also objectively performed better using these scenarios. Additionally, participants reported on their absolute subjective size judgment error. In general, participants objectively performed better with a lower absolute median error than they subjectively expected it to be (positive values only) (see Fig. 13). For (T) scenario, the perceived median absolute error is 20.00 cm, whereas objective median error is 14.08 cm. The same holds for (F) (perceived 20.00 cm, objective 11.08 cm) and (FI) (perceived 15.00 cm, objective 9.25 cm)
“I personally prefer the following scenario”: The highest ranked scenario is (FI) with 59.09%, followed by (T) (31.82%) and (F) (9.09%). Despite (T) is ranked second as a preferred scenario, participants who preferred this scenario never performed best (0/7) in terms of precision and most of them even performed the worst (5/7). Additionally, the questionnaire gathered free answer possibilities: The participants reported that when using (FI), they felt “more confident estimating sizes” (3×), “used natural walking” (1×) to estimate the absolute lengths and to change their “viewing perspective” (2×) so that the squares are “right in front of them” (1×). They report to get a better “spatial sense” (1×) and realism degree (2×). Additionally such a true to scale visualization is helpful. People who prefer the (T) scenario subjectively mentioned a better “overview” (3×) and better “comparison with ruler” (2×) due to the smaller display size and “higher resolution” (1×).
The results of this study indicate that both absolute- (true to scale) and relative-scale visualizations have advantages:
For absolute-scale visualizations, there is a significant change in size judgment accuracy between tablet and both floor scenarios (F) and (FI). Using the LED floor with true to scale visualization has a positive influence on the precision of size perception. These experimental results are in accordance with earlier findings by the authors (see Otto et al. ). There, cascadable, room-scale projection systems are used to realize also industrial applications. In addition to LED-based and projection-based systems, more and more industrial application scenarios are realized using VR/AR interaction techniques. Using these head-mounted displays (HMDs) lacks two main benefits compared with augmented floor surfaces: First, HMDs are single user devices whereas augmented floor surfaces can be utilized with groups up to 30 people. Second, perceived spaces and sizes are frequently underestimated using VR HMDs (following Kelly et al. ) even though this effect gets smaller with state-of-the-art headsets. In contrast, LED floors do not show effects of overestimation or underestimation following the results of this publication. Therefore, using true to scale visualization enables the participants to judge sizes more accurately.
For relative-scale visualizations, task completion times tend to be lower. Overall, using the scenarios (F) and (FI) is slower than using (T). Even though, lower task completion times could be a hindrance factor for other use cases, in automotive production validation, task completion time is less important than a high accuracy.
Another interesting effect in human size judgments are rounding habits: All participant reports size judgments in a rounded form: Typical reports of size estimation granularity are 5 cm (5/22), 10 cm (16/22), and 25 cm (1/22) steps. None of the participants gave sub-centimeter precision results. Therefore, rounding effects are still smaller than the perceived size judgment capability (compare Fig. 13).