In our user studies, we investigated in more detail the four approaches that were introduced in the preceding section: voice chat, sending emotion states, FoV indication, and video chat. In each of the four experiments, two methods were compared to each other, by using a within-subject methodology. The independent variable was the method; the dependent variables were presence, sickness, usability, and togetherness. In the second part, the components were compared to each other (between-subject) to find out which of them make the biggest contribution to social viewing experiences, and which ones play a minor role and may possibly be omitted.
Participants and material
In our studies, we focused on shared viewing by two persons in symmetrical environments (both using HMDs) to gain initial experiences. A total of 86 participants took part in the entire study (see Table 1). Among them were both VR beginners and experienced VR users. Some of the participants knew each other beforehand, others did not. We did not investigate the dependence of the results on these characteristics. For two experiments, the second person was not available and a person of the team took on the role of the co-watcher the data of which was not included in the data set. All methods were implemented in a way that they can be used for remote and co-located environments. For all tests, an Oculus Rift with headphones was used. The used films were nature documentaries, similar in style and pace, and had an approximate length of 8 min.
Except for the voice chat case, all tests were conducted in a large room, where a remote environment was simulated: both participants sat far from each other and were only connected by the network. They did not speak during the study. We chose this one-room setting to observe both participants in parallel. This was only possible for the visual methods since the HMD blocks the visual real environment. However, voices were not blocked, even if the participants used headphones. Therefore, for the voice chat test, the participants were in neighbour rooms.
For each of the components, two methods were compared, as indicated in Table 1. The aim of our study was to find out which method is more suitable for social viewing in CVR, and on the other hand to learn more about the advantages and limitations of each method.
Voice Chat: Two voice chat methods were tested. The first one used spatial sound, while the second one used normal stereo sound, which did not depend on the viewer’s line of sight. The spatial sound came from the direction in which the speaking participants looked (“fromFocus” in Fig. 1).
Video Chat: The two video chat methods (Fig. 4) differed in the position of the chat window. In the front-method, the other person was in front of the view and the chat window was fixed at the bottom of the display and turned with the line of sight. This case is similar to the situation where two people are sitting opposite each other. However, the window is always in front even when turning around. For the second method (side-method), the video chat window was fixed beside the speaker, in our study on the right side. In such a case, it depends on the viewing direction whether the person can be seen in the display. This case corresponds to the situation in a cinema or sitting on the couch and watching a movie together. Even if in a real application a video chat is combined with voice chat, the method was tested without spoken language, since we were interested in the influence of each component separately.
FoV Indication: To investigate FoV awareness, we chose two methods: the bar-method and the PiP-method. For the PiP-method, the FoV of the other person could be seen on a little video window that was fixed to the display and did not change its position during turning around (Fig. 3a). The bar-method is inspired by a glider warning system. At the bottom of the screen and on the right side, there are bars along a line (Fig. 3b). One of the bars is coloured red and indicates where the other person is looking at.
Sending Emotion States: For sending information about feelings, two visual methods were compared. Smileys were sent in the first and photos of facial expressions were sent in the second method. For both methods, four pictures were available: two affectively positive and two negative ones. A hand-held controller was used for sending the pictures. The position of the picture was always the same, fixed on the edge of the display.
A within-subject test design was used to compare the two methods for each component. Each participant watched two films, each with a different method. The films and methods were counterbalanced in order and assignment. After each film, a part of the questionnaire about simulator sickness, presence, and togetherness was filled out.
To measure simulator sickness we applied the Simulator Sickness Questionnaire (SSQ) of Kennedy et al. (1993). For each item one of the sickness levels (none, slight, moderate, severe) could be chosen and the answers were transformed to a scale from 0 (none) to 3 (severe). To investigate the presence, we used the IPQ presence questionnaire (Schubert et al. 2002). The questions were answered on a seven-level Likert scale. Since not all questions of the SSQ and IPQ are appropriate for CVR, we did not include all items. In this way it was not possible to calculate the total score exactly as originally defined for the scale. However, we received enough information to compare the different test options. For the togetherness part, the following questions from the ABC questionnaire (IJsselsteijn et al. 2009) were chosen:
(S1) I feel part of a group because of the contacts.
(S2) I know what the other feels during contact.
(S3) Because of the contact, I can relate to the other person.
Again, we used a seven-level Likert scale for the answers. At the end of the study, the participants answered questions for comparing the methods:
(C1) Which method do you find more comfortable? (usability)
(C2) Would you use the method for a longer time? (usability)
(C3) With which method do you feel more connected with the other? (togetherness)
For the questions (C1)–(C3), the participants were asked to justify their answers.
The methods are implemented in Unity3D 2017. For the implementation of the synchronization features, the Multiplayer High Level API (HLAPI) was used (“Unity —Manual: The Multiplayer High Level API,” 2018). With this API no additional server is necessary, one of the participant computers can simultaneously act as client and server. For each co-watcher, a player prefab was defined, which includes the network identity and contains the necessary properties and functions. This prefab is invisible and presents the viewer in the VR environment. It was placed in the centre of the sphere on which the omnidirectional movie is projected.
The Unity NetworkManager component manages the communication and the synchronization of the scenes. For avoiding network problems, the movies were locally stored on the client and just positions, directions, and meta information were transferred via the network. In case of poor network quality, in real-life applications, the video should also be uploaded in advance, to avoid interfering with the experience.
Voice Chat: The voice communication was realized by the Unity plug-in Dissonance Voice Chat; the spatial condition, additionally by the Oculus Spatializer Plugin.
Video Chat: The webcam frame was implemented as an object that could be switched on/off by the user. Additionally, transparency could be customized. For the front-method, a webcam was placed in front of the viewer, for the side-method at the side. In order to convey the feeling of sitting side by side and to have the opportunity to look at each other, we positioned the camera for one user on the left side, for the other one on the right side.
FoV Indication: For both methods, the PiP as well as the bar, the Unity Raw Image was used. It shows non-interactive images to the user and can display any texture.
Sending Emotion States: The smileys and picture elements were realized by the Unity Raw Image component since a RawImage element is well suited for the representation of 2D graphics.
Results -Part 1: comparison of the methods for each of the components
For comparing the two methods of each component, we performed a two-sample t test (alpha = 5%) for each Likert-item, which showed nearly no significant differences regarding presence, sickness, and togetherness. The only difference was in the video chat component for question (S2), where the score for the side-method (mean = 4.41, SD = 1.87) was significantly higher than for the front-method (mean = 3.27, SD = 1.78, p = 0.04).
We used the exact Fisher test to calculate the p-values and to find significant differences in the results of the comparative questions (C1-C3). Additionally, we analysed the qualitative answers. In this way, we found preferences, advantages, and disadvantages, which we present in detail below.
Voice Chat: Both methods were accepted by the participants. Most of them would like to use it, even for longer videos (78.2%, 87%). When asking about the preferred method, the spatial-method received a higher score for all questions (Table 2).
Most participants preferred voice chat with spatial sound. P14 compared both methods in this way: “In the first method (stereo), the bulk of the conversation was about the gaze direction. With the second method (spatial), I could tell that just by the direction from which his voice came, and you could talk about the pictures right away instead of having to find them first.” P22 preferred the spatial-method: “It has felt more integrated into the video through the different locations of the sound. This made the experience more interesting.” Some participants mentioned the advantages and disadvantages of the spatial-method: “The spatial-method was helpful in finding the view. However, it was also a bit more distracting.” (P17). “The stereo-method distracts less; with the spatial-method, you also want to look where the other is looking and thus you always seek the right view”.
For a better understanding of social viewing, the following reasons are relevant.
“helped to find the view from the other one” (P8, P11, P14, P17, P18, P19, P21)
“more spatial and more real” (P10, P21, P23)
“feeling of being in the same room” (P11, P22)
“closer to the real experience” (P14)
“more like exploring an environment together” (P11)
“more interaction” (P13)
Some of the participants preferred the stereo-method:
“more familiar” (P15, P16)
“less confusing” (P16)
“less distractive” (P16, P17)
“can hear well” (P5, P6)
Video Chat: In this study, the scores for usability and togetherness were very similar for both methods (Table 3).
Analysing the qualitative data, we could recognize important findings for both methods. P18 remarked: „It depends on the content. For watching a cinematic movie, I find the side-method better”.
The front view was superior in the following aspects:
“more comfortable” (P2, P6)
“better for communication” (P11, P17, P20, P21)
Reasons for preferring the side view were:
“similar to cinema/TV” (P1)
“can see more of the partner when he looks ahead” (P1)
“feel addressed when he turns to one” (P1)
“disturbs less, is more realistic” (P4, P7, P10, P11, P14, P19)
FoV Indication: In the FoV awareness part, we could not find any significant differences between the two methods. (Table 4).
In the qualitative answers, we found advantages of both methods:
Benefits of PiP:
“easier to understand and faster” (P3, P4, P18)
“both views can be seen simultaneously” (P8, P12, P16, P18)
“I see what the other one sees” (P19)
“more personal “(P10), “more connected” (P15)
“does not have to turn the head” (P10)
Benefits of bar:
“concealed less” (P1)
“less intrusive/discreet” (P5, P6, P7, P11, P20, P21)
“easier to understand” (P5, P16, P20), “better orientation” (P14)
Sending Emotion States: For most participants, the smiley-method was more comfortable (75%). More participants would like to use the smiley-method for a longer time (85%). However, the feeling of togetherness differed just slightly (Table 5).
For some participants, the smiley-method seemed familiar, but for others it was the face. P20 pointed out that images of faces could create expectations for communication. Several participants answered the second question in another way than the first one with the substantiation that the face is more realistic (P3, P13, P16).
Benefits of smileys:
“faster and easier to recognize” (P2, P3, P5, P6, P8, P11, P16)
“familiar “(P2, P18)
“anonymous in the case of unknown people” (P7, P12)
“distract less” (P13, P18)
Benefits of face:
Results -Part 2: comparison of the components
In the second part of the study, we investigated which components are important for feeling togetherness. For this, we compared the favourite method of each component from part 1 to each other:
Voice chat: spatial-method
Video chat: front-method
FoV indication: PiP-method
Sending emotion states: smiley-method
Using the Shapiro–Wilk test, homogeneity of variances was checked which showed that equal variances could not be assumed for all components. Therefore, the Kruskal–Wallis test was used for finding significant differences between the components. There were no significant differences for presence and sickness. Using the post hoc pairwise Mann–Whitney tests, we found significant differences for the togetherness aspect (questions S1–S3). In Fig. 5 the means and p values are summarized and illustrated. For significant differences (p < 0.05) the p values are added in black, for weak significant differences (0.05 <= p < 0.1) in grey.
As Fig. 5 shows the smiley-method was most important for togetherness, followed by the spatial voice chat and the video chat. Knowing the FoV of the co-watcher was less important for the participants of our study. We discuss the more thoroughly in the next section.