Introduction

Awareness refers to actors’ taking heed of the context of their joint effort [40]. Even though it seems to be more a question of observing and showing certain modalities of action, information sharing is crucial to develop awareness, as it allows teams to manage the process of collaborative working, and to coordinate group or team activities [15]. Awareness information therefore plays a mediating role for collaboration and creating shared understanding [24]. However, several different types of awareness can be distinguished [40]: general awareness [21], collaboration awareness [32], peripheral awareness [8, 22], background awareness [11], passive awareness [15], reciprocal awareness [17], mutual awareness [7], workspace awareness [25].

Workspace awareness is defined “as the up-to-the-moment understanding of another person’s interaction with the shared workspace” [25]. For workspace awareness, people need to gather information from the environment, understand it and predict what this means for the future. Shared visual spaces provide situational awareness [16] and facilitate conversational grounding [18, 19]. In collaborative environments, visual information about team members and objects of shared interest can support successful collaboration and enables greater situational awareness [23]. Situational awareness is thus crucial for fluid, natural and successful collaboration to adjust, align and integrate personal activities to the activities of other distributed actors [25]. Designers of a collaborative system need to take many different aspects into account in order to support awareness, although this is often not a primary goal in developing a system of this type [25]. This means that generally the major goal is not just to provide and maintain awareness, but to complete certain tasks in the environment.

In many domains, a quick and adequate exchange of visual context-related information to establish a common ground is necessary in order to make proper decisions and to avoid costly mistakes that cannot be easily undone. Augmented reality (AR) systems allow users to see the real world, with virtual objects superimposed upon, or composited with the real world [5, 6] where virtual objects are computer graphic objects that exist in essence or effect, but not formally or actually [36]. AR systems are not limited to the use of HMDs and mainly have to combine real and virtual objects as previously described, be interactive in real-time and register objects within 3D [5]. AR systems can be used to establish a common ground during cross-organizational collaboration in dynamic tasks [37]. AR systems have also been used to increase social presence in video-based communication [14] or to help in complex assembly tasks [30]. They can further be used to establish the experience of being virtually colocated. Virtual colocation entails that people are virtually present at any place of the world and interact with others that are physically present in another location by using AR techniques. Example of such virtual colocation can, e.g., be found in the field of crime scene investigation [39], inflight maintenance [13] or information exchange in the security domain work [34]. Such new approaches create new collaborative experiences and allow distributed users to collaborate on spatial tasks, create a shared understanding and establish a common ground.

In previous work, we have developed AR systems for virtual colocation [34, 39]. A local investigator wears an head-mounted display (HMD) with an integrated camera. By streaming the video captured from the camera, a remote colleague can see what the local investigator is seeing and both can interact in AR. Usability studies of our AR system with employees from different security organizations and reenacted scenarios show that such AR systems are suitable for information exchange, as well as distributed situational awareness and collaboration of teams in the security domain. The usability studies, however, also revealed some issues. Local users had a limited workspace awareness and experienced discomfort when virtual content added by the remote expert unexpectedly popped up in their view without any prior notification. This problem appeared only on the local user’s side. Remote users reported no problems with understanding the local users’ activities within the shared virtual space. In order to address the limited workspace awareness of the local users, we implemented an automatic audio/visual notification mechanism to inform local user of remote user’s activities.

In this article,Footnote 1 we evaluate and compare three conditions (i.e., no, audio and visual notifications) to improve workspace awareness for the local user while performing collaborative tasks in AR. We also investigate how much this additional information influences the local user’s personal focus in performing the tasks. The paper is organized as follows: “Related Work” section presents the related work on AR systems that support collaboration between a remote expert and a local worker. It also covers literature on workspace awareness in different types of collaborative systems. “User Study” section presents the user study, with details about the task design, questionnaires and the system architecture of our AR environment. “Results” section discusses the results and relevant findings. The paper ends with a critical section in “Critical Observations” section and conclusions and future work in “Conclusion and Future Work” section.

Related Work

Remote Collaboration

There are many recent examples from different domains in which AR systems successfully enable collaboration between a local user and a remote user.

Kurata et al. [31] built the Wearable Active Camera/Laser (WACL) system that allows a remote expert to observe and point at targets in the real workplace around the field-worker, by controlling the WACL with a mouse click on the video image received from the local operator through a wireless network. The system enables the remote collaborators not only to independently set their viewpoints into the local user’s workplace but also to point to real objects directly with a laser spot.

In a project on improving future crime scene investigation, technological foundations for users interacting with and within a virtual colocation environment were developed [39]. Here, remote expert investigators can interact with one local investigator who wears a video see-through HMD. By using simple spatial tools, the remote user can place restricted area ribbons, highlight objects of interest, or analyze a bullet trajectory. The virtual objects placed cannot be edited or moved around.

Gauglitz et al. [20] implemented a markerless tracking system that uses no prior knowledge about the scene modeled, to enable a remote user to provide spatial annotations that are displayed to a local user with a handheld device.

Adcock et al. [2] built a prototype that allows remote collaboration between two people in which only the remote user is able to add virtual content. The environment of the local worker is captured using depth cameras and transmitted to the remote expert. The remote user makes annotations via a touch screen, creating lines, markers and symbols to assist the local user. The annotations are then displayed remotely with a data projector.

Huang et al. [30] developed a system for remote guidance that allows the expert to be immersed in the virtual 3D space of the worker’s workspace. The system fuses the hands of the expert in real time, captured in 3D, with the shared view of the worker, creating a shared virtual interaction space.

Following the same line of thought but different hardware, Sodhi et al. [41] developed BeThere, a proof-of-concept system designed to explore 3D gestures and spatial input which allow remote users to perform a variety of virtual interactions in a local user’s physical environment, using a self-contained mobile smart-phone with attached depth sensors.

Lukosch et al. [34] developed an AR system to support visual communication between a local and a remote user in the security field. Users can interact in the shared work space, adding virtual content using both a 2D graphical user interface (for the remote expert) and a 3D user interface with hand gestural input (for the local user who wears an optical see-through HMD).

These examples illustrate the use of AR to support collaboration between a local worker and a remote expert in various domains. Though user studies have shown that virtual colocation is possible and effective, the same studies have shown that local users feel remotely controlled by the experts, diminishing their abilities. Additionally, the experts feel that they miss something when they are not physically present at the scene [9, 39]. It remains an open research question about how users can become and stay aware of each other’s activities in an AR-based collaboration environment for virtual colocation. In the following section, we present different solutions proposed for improving workspace awareness in systems that allow collaboration between members of a team.

Workspace Awareness

The concept of workspace awareness has been researched in the field of computer-supported cooperative work to address a variety of coordination problems [14, 44]. Gutwin and Greenberg [25] distinguish three different information categories that contribute to workspace awareness:

  1. 1.

    Who: This category provides information on the presence of others, identity and authorship.

  2. 2.

    What: This category provides information on users’ activities, their intentions and the affected artifacts.

  3. 3.

    Where: This category provides information on the location of activities, the gaze direction of users, the view of users and the reach of users.

As workspace awareness is maintained by different types of notifications of other people’s activities during synchronous or asynchronous collaboration delivered to the users, the interruptions caused by these notifications can become a problem for the personal focus of the users. The effects of interruptions on people’s activities have been thoroughly studied in literature: it has been repeatedly noted that an interruption has a disruptive effect both on user’s task performance and emotional state [35].

Ardissono et al. [4] described a context dependent organization of the awareness information and presented an analysis of interruption and notification management in the collaboration environments of heterogeneous Web applications. Notifications were delivered in a graphical form as pop-up windows in the low-right corner of the screen.

Dourish and Bellotti [15] analyzed four collaborative writing systems to explore three different approaches to the critical problem of group activity awareness: the informational approach, the role-restrictive approach and the shared feedback approach. The results show the usefulness of information on progress and joint activities in increasing the collaborators’ awareness. Furthermore, the results outline the tension between group awareness and personal focus.

In most collaborative systems, awareness is maintained by visual clues, which may lead to the overloading of the visual channel for conveying information. In addition, audio notifications have been studied in human-computer interaction (HCI) as an alternative to provide awareness. This has been studied using symbolic sounds (named “earcons”) played to indicate particular events [10]. These are useful for systems that require few earcons. Users can quickly associate an earcon with its representation, but must still remember a mapping. However, for a large number of earcons the message can become very sophisticated.

Gutwin et al. [27] developed a granular synthesis engine that produces realistic chalk sounds for off-screen activity in a groupware. The experiments they conducted proved the effectiveness of synthesized dynamic audio that accurately reflect the type and quality of a digital action for providing information about the other’s activity.

Hancock et al. [28] conducted two experiments with an interactive multitouch, multiuser tabletop display on the use of different non-speech audio notifications. First, they used affirmative sounds to confirm user’s action and negative sounds to indicate errors. Secondly, they tested two conditions: localized sound, where each user has their own speaker and coded sound, where users share one speaker, but the sounds are personalized for each user. The results show an improvement of group awareness but also reiterate the presence of tension between group awareness and individual focus as discovered by Dourish and Bellotti [15].

Compared to the related work presented above, in our current study we use automatic audio/visual notifications that are generated whenever the remote user interacts with the system. Thus, we provide information on activities performed by the remote user (the “what” information category, as identified by [25]). The user study focuses on comparing three conditions (No notifications, Audio notifications and Visual notifications) under two different workload levels of the task. We hypothesize that adding the automatic notifications affects the user’s ability to focus on the task, but we consider this as part of the trade-off between improving the workspace awareness and the personal focus in a collaborative system.

User Study

In order to explore the effect of audio and visual notifications on the collaboration process, a user study was conducted. Having in mind the tension that exists between group awareness and personal focus of the individual, our goal was to find out which communication channel is more suitable to receive notifications in our system and what is the proper information to be sent to the local user [26].

Two variables were used in our study: the type of notifications the system generates and the workload level of the task (two levels of difficulty). We included both variables in the design of the experiment as well as in the evaluation instrument of our AR framework.

Task Design

We aimed at developing a clean research environment to study the use of AR to support workspace awareness in a spatial task. Games in general are useful as research tools, as one scenario can be repeated under the same conditions several times and thus offers the opportunity of comparable test situations. Games as research instruments do not focus on players’ knowledge creation or adaption, but instead allow researchers to investigate elements, such as actors and processes, in a controlled environment [43].

Learning transfer in these games does not take place between game and the player, but between game and an outside observer [38]. The observer can intervene in the gaming process according to his or her research aims. In our case, using a game allowed us to change the level of task complexity during the game play. In general, when games are used for research, the validity, or degree of correspondence between the reference system and the simulated model thereof, is crucial [38].

Following the conceptualization of [38], a researcher starts with one or several questions about the reference system, with the handicap of being unable to collect data about it. As valid simplification of the reference system, a (simulation) game is developed or chosen, and played, and data about it are gathered. After data collection, analysis and interpretation, the researcher has to translate his or her findings back to the reference system, in order to make a difference to it. Therefore, the game should show a high level of realism, or fidelity, to make sure that the reference system is represented with all necessary roles, rules, actions, and decisions included. In our case, we found a game that provided a realistic representation of a spatial task with different levels of task complexity, allowing us to study virtual colocation.

We designed a scenario that fits into the general situation where a remote expert provides information to a local worker to accomplish a certain task. We created a user case that allowed for controlled experiments, in which the local user receives assistance in solving a 2D assembly puzzle named KataminoFootnote 2 (see Fig. 1). The Katamino puzzle was chosen for following reasons:

  • The concept of a puzzle supports the design of scenarios in which a remote “expert” provides instructions to a local “worker”.

  • Katamino offers different levels of difficulty. Depending on the position of the slider on the boardgame, there are 9 levels increasing in difficulty (between 3 and 11). For example, setting level 10 (as in Fig. 1) implies that the local player has to use 10 pieces to fully cover the rectangle area on the board game defined in the right part of the slider. With such different levels of difficulty it is possible to evaluate whether the notifications affect the focus of the local “worker” under different task loads.

    Fig. 1
    figure 1

    Katamino assembly puzzle

  • For each level of difficulty, there are several different solutions, depending on the pieces chosen at the beginning of the game. For each level of difficulty, the local users were asked to solve the puzzle 3 times, each time using different sets of pieces. We consider that solving the puzzle for one set of pieces will not have an immediate effect on the ability of the player to solve it with a different set of pieces. Thus, we tried to minimize the bias related to the learning effect of repeating a task.

By superimposing different virtual objects over the pieces of the puzzle, the remote user provides instructions that lead to certain solutions chosen beforehand. Using a 2D GUI, the remote user can communicate to the local user different actions: to remove a piece that is not part of the current solution, to place a piece or to rotate it on the board game. In order for the remote player to identify certain squares on the board game using text messages, we added the letter AE on the slider (e.g., A3 means the intersection of the row A with the column 3).

In our experimental design, we have not included audio communication in between the users, as this might have influenced the workspace awareness for the local user. Instead we chose to focus only on providing automatically generated awareness information, preventing interference with other information sources that could have affected our observations on workspace awareness.

Participants

Twelve participants played the game as local users in the experiment, each rewarded with a 10 EUR gift card. The gift card was provided as motivation to attend the experiment. In order to ensure the same conditions for providing instructions to each participant playing the role of local user, the remote user was kept the same person for all of the 12 local users. There were 4 females and 8 males, with an age between 18 and 44 years (\(M = 26.7\); SD \(=\) 6.8), all with academic connections (6 bachelor students, 1 master student, 4 Ph.D. candidates and 1 lecturer).

Measures

Although awareness to users is an essential feature in a collaborative system, evaluation is not straightforward and little research has been done for the assessment of the quality of the awareness provided by a system [3].

In [25], the authors identified specific elements that characterize the workspace awareness in a system. Some of them are related to the present (e.g., presence, identity, action, intention, location and gaze) and other elements are related to the past (e.g., action history, event history, presence history and location history). Antunes et al. [3] considered that there are three important issues associated with workspace awareness: tasks characterized by who, what, when and how they are accomplished; interaction that defines how the group interacts with the workspace and what information is necessary to sustain it and finally the level of task interdependence perceived by the group.

Starting from the work of [25] and [3], we created a list of questions that are relevant for our AR system and applicable in the current hardware configuration. For instance, we do not have a sensor for eye-tracking and thus cannot provide information on the gaze of the remote user to the local user. Considering the different levels of difficulty in the Katamino puzzle and the potentially disruptive effect of the provided notifications, we added the NASA-TLX questionnaire [29] to assess the task load of the local users. Since awareness is not the only goal to be achieved in designing a collaborative system, we were also interested to find out if the local users were able to focus on the task they were supposed to do and also if they were overloaded with too much information. The resulting set of questions is shown in Tables 1 and 2. All responses in questionnaire 1 were on a five-point Likert scale from 1 (strongly disagree) to 5 (strongly agree).

Table 1 Questionnaire 1—questions after each condition
Table 2 Questionnaire 2—questions after all conditions for each level of difficulty

Equipment

We have developed a framework named DECLARE (DistributEd CoLlaborative Augmented Reality Environment) based on a centralized architecture for data communication to support virtual colocation of users.

DECLARE is a multimodal, multiuser, highly scalable parallel framework that integrates a shared memory across the running components. The data communication is ensured via a modular framework integrating a shared memory mechanism (being part of DECLARE) which decouples the data transfer in time and space. The decoupling in time mechanism of DECLARE implies that if either the local user or the remote user disconnects temporarily, the video and AR data related to the current work session is automatically transferred as an initial update at the next work session of the user. Practically the remote expert and the local users resume the activity according to their roles within the scenario without losing track of the activities occurring while they were offline.

DECLARE can be adapted to different types of display devices, depending on the specific limitations of each device. For instance, a video see-through HMD has the advantage of a higher field of view, the possibility to display solid virtual objects and has a good alignment of the virtual content superimposed over the real world. However, due to the processing requirements capturing through a camera, then streaming the video images displayed in the headset is a process often leading to a higher latency. This creates a problem of getting dizzy. On the other hand, with an optical see-through HMD, we encounter a difficult problem in aligning the virtual content to the real world. For our present study, we used the optical see-through META SpaceGlasses Footnote 3 (see Fig. 2) which have an integrated depth sensor and RGB camera. Technical specifications are: display resolution \(960\times 540\) pixels (qHD), aspect ratio 16:9, FOV 23\(^\circ\); RGB Camera \(1280\times 720\) (MJPEG), 30 fps.

Fig. 2
figure 2

The optical see-through META SpaceGlasses

All components of DECLARE communicate through a shared memory space via wireless or wired connections (see Fig. 3), using a data and event notification approach (see Fig. 4). The video frames captured by the RGB camera of the headset worn by the local user are streamed in real time to the remote expert. For placing the virtual objects, we use the robust dynamic simultaneously localization and mapping (RDSLAM), a state-of-the-art markerless tracking model, with its implementation as provided by authors [42]. This module receives the video frames from the local user’s HMD camera. Based on the input video frames, RDSLAM computes for each frame the parameters of the camera position and orientation together with a sparse cloud of 3D tracked points. These tracked points are used to attach the virtual content in AR. In order to make the RDSLAM ready for tracking, a prior calibration has to be done by the local user, by moving the HMD horizontally (see Fig. 5 right). This is done once, at the beginning of the play session and later during the play session, if the tracking becomes unstable.

Fig. 3
figure 3

Diagram of the main components of DECLARE

Fig. 4
figure 4

Diagram of data and event notifications for user actions across DECLARE-based modules and subsystems

In addition to tracking the HMD position and orientation, RDSLAM performs a mapping of the physical environment (of the local) and further generates an internal representation of the world. The result of markerless tracking relates to information on the HMD camera location and orientation while the mapping result relates to a representation of the physical world in form of a sparse cloud of 3D points.

The sparse cloud of 3D points represents visual key-points which connect the augmented world to the physical world and further act as virtual anchors supporting annotation by AR markers. Once attached to a key point (see Fig. 6), a 3D-aligned object is correctly rendered in the next video frames on both conditions during run time, being consistent with the HMD camera motion (see Fig. 5 left).

Fig. 5
figure 5

Diagram of events for the local user tracking at runtime (left) and for tracking calibration (right)

Fig. 6
figure 6

Diagram of data notifications for the selection event

In DECLARE, the network communication is implemented using both TCP/IP and UDP standard. TCP/IP is used for data transfers between the server and software modules running on hardware devices linked via network cables. UDP is used for data transfers between software modules connected via wireless links. In case of the UDP-based network communication, each frame from a video sequence is encoded as a compressed image into a UDP packet. The VGA resolution (\(640\times 480\) pixels/frame) and JPEG compression quality of approximately 85 % proved to fit below the size limit of the UDP data-grams.

Updates are automatically sent to each software module or client subsystem (subsystems such as the software application of the local user or the software application of the remote expert), with regard to the new data available, by using a notification and push system of events and data (see Figs. 46).

The automatic processing of events and data notifications is only done for the software modules and subsystems that register for the specific type of data. In this way, the subsystem of the local user (see Figs. 46) does not receive video data from its own camera via the network. This ensures an optimal use of the network bandwidth, especially in wireless data communication.

Consistency of actions is a critical aspect implemented in DECLARE. Local updates in the graphical user interface with regard to direct user actions in the system of the local and on the remote experts system, are taken into consideration only when data and events related to the action are received as feedback from the server. This ensures availability of the updates on all the subsystems and modules. The flow of data and event notifications is illustrated in the diagram in Fig. 4. For the video stream, a synchronization mechanism is implemented in the shared memory to ensure the same video frame is played for the local, remote and the RDSLAM module at the same time.

Considering the technical specifications of DECLARE framework, we further implemented specialized functionalities for both local user and remote user, and created different user interfaces using Unity3D game engine.

Local User Support

The video captured by the RGB camera mounted on top of the HMD worn by the local user is sent to the other components in the system. In order to align the virtual content (see Fig. 7) with the real objects in the view of the local user, we had to clip an area of the image captured by the RGB camera of the headset. This area corresponds to what the local user sees through the display of the HMD.

The Remote User

The remote user can view on a laptop the video captured by the local user’s HMD camera and can interact with the system by using the keyboard and a standard mouse device through the means of a classic 2D GUI. In the left part of the screen (see Fig. 8), a menu with buttons that allow the remote user trigger different actions is shown. Besides text messages, the remote user can attach virtual objects as shown in Fig. 7 (from left to right: the “approval” sign indicates a piece that has to be used by the local player, the “reject” sign indicates a piece that has to be removed from a configuration and signs that indicate that a piece should be rotated by 90\(^\circ\) clockwise or counterclockwise).

Fig. 7
figure 7

Icons used for providing instructions to the local user

Regarding the positioning of the virtual content, there are two types of virtual objects: the fixed objects that stay in the same position relative to the camera (on the screen) and the virtual objects that are linked to the points tracked by RDSLAM algorithm. The remote user has the option of “freezing” the video stream by pressing the “F” key. While doing this, he/she is still connected to the local user, seeing the live streaming of the camera in the right upper part of his view (see Fig. 8 right). The full image for live streaming reappears when the remote user presses the “U” key. The transparent rectangle in the middle of the image (see Fig. 8) represents the area of the HMD used to display additional information. When objects are within this area, the remote user knows they are visible to the local user.

Fig. 8
figure 8

The view of the remote user

The Automatic Notifications

In order to support workspace awareness in our AR framework, we implemented different cues for the local user, every time the remote user is interacting with the system. Each notification is automatically sent to the local user to inform them about the action taken by the remote user. These cues are presented to the local user as either audio or visual notifications:

  • Audio (speech) notifications: every action taken by the remote is automatically indicated with an audio message. The audio notifications are short spoken messages generated by an online text-to-speech synthesizer,Footnote 4 to distinguish between different actions of the remote user. The message described the action taken (e.g., “Remote added an object,” “Remote selected an object,” “Remote deleted an object,” “Remote freezes the image,” “Remote plays video stream again”) and is played only once to the local user.

  • Visual notifications: Every action taken by the remote user is indicated in the right lower corner of the local user’s view as a small icon that blinks twice and then disappears. We have chosen suggestive icons for each action of the remote user (see Fig. 9 from left to right: adding, selecting and deleting virtual objects, pausing the video stream and playing it again).

Fig. 9
figure 9

Icons used for visual notifications

The Experiment

At the beginning of the experiment, each local user spent about 10 min to be informed about the rules of the game; the conditions in which the game would be played; the type of augmentation that can be provided by the remote expert and also about the notifications generated by the system.

Then the local user solved the puzzle 6 times, i.e., under 3 conditions, each for 2 levels of difficulty (7 and 10). The exact same 6 configurations of the puzzle chosen beforehand were solved by all of the 16 participants, who were asked to find each solution in approximately 5–7 min.

Each participant needed between 75 and 90 min to do the experiment. Solving the puzzle in all conditions took about 40 min. The rest of the time was spent on filling in the questionnaires and on a debriefing session at the end to collect additional observations and suggestions that were not included in the responses of the questionnaires.

The local user wore an optical see-through HMD and was sitting in front of a table on which the pieces of the puzzle were spread (see Fig. 10). The remote user was sitting in a separate room, in front of a laptop, having written on a paper the solutions for all the 6 configurations of the puzzle, found by using a Katamino solver application.Footnote 5 The instructions were provided using only visual communication between users.

Fig. 10
figure 10

Pictures taken during the experiments (left) the remote user and (right) the local user

Each round started by removing the pieces that are not needed to build a certain solution (which is known only by the remote user). In this stage, the remote player marked with the virtual label “reject” from Fig. 7 each piece that the local user has to put aside (5 pieces to be removed for level 7 and 2 pieces for level 10). After that, the local user received information on the correct position of 2 pieces for level 7 (and 3 pieces for level 10). A piece was indicated using the virtual label “approval” in Fig. 7. The position on the board game was given as a text message with squares that should be covered by that piece (e.g., “A1 B3”). After that, the local user was supposed to solve the puzzle alone. Considering that the solution had to be found in 5–7 min, instructions continued to be provided as long as the remote user considered this to be necessary.

After each condition, the local users were asked to answer the questions in Questionnaire 1. After performing under all conditions, they filled in twice Questionnaire 2, once for each level of difficulty.

Results

Results from the Questionnaires

We analyzed the Likert scale responses (Q1.1–Q1.16) based on two factors namely the game level, i.e., L1: level 7 and L2: level 10, and the conditions: No, Audio, Visual.

In order to see whether the participants perceived differently the two games levels L1 and L2 in terms of task load, we run the parametric paired sample t test for the pairs (L1, L2) under each condition, for the NASA-TLX questionnaire (Q1.11–Q1.16). The values in Table 3 indicate that for most of the comparisons, the null hypothesis was not rejected at a significance level \(\alpha =.05\) (bold values are the p-values lower than α). That means that in these situations there are no statistical differences between the two levels of the game L1 and L2. But for Q1.11, the paired t test rejected the null hypothesis for no and audio conditions, while for the video condition the p-value is above but very close to the threshold value 0.05. Comparing the mean values in Table 4, we conclude that the users perceived the task in game level L2 as being more mentally demanding than the task in game level L1.

Table 3 The p values for the paired sample t test for the pairs (L1, L2) under each condition
Table 4 Mean/SD values for Q1.11 under each level (L1, L2) and under each condition

The Likert scale ratings between the conditions were checked using the Friedman test (\(\alpha =.05\)). For the situations for which the null hypothesis was rejected (i.e., there were statistical differences between the three conditions), a pair-wise Wilcoxon signed-rank test was applied. All these results are obtained using MATLAB toolbox for statistics.

The values in Table 5 resulted by applying Friedman test. For game level L1, the questions with Likert ratings significantly different between conditions are (Q1.4, Q1.5, Q1.6, Q1.7, Q1.10 and Q1.16) and for game level L2, (Q1.4, Q1.6, Q1.7 and Q1.16).

Table 5 Friedman test (\(\alpha =0.05\))

Further on we performed pair-wise comparisons between conditions. The results are displayed in Figs. 11 and 12 and can be summarized as following:

Fig. 11
figure 11

Results of Wilcoxon signed-rank test for game level L1

Fig. 12
figure 12

Results of Wilcoxon signed-rank test for game level L2

Audio Versus No

Audio helped the local users to be more aware of when the remote users were doing an action (Q1.4) in L1 (\(Z=2.41, p=0.016, Md=4.5\) vs. 4) and L2 (\(Z=2.69, p=0.007, Md=4\) vs. 3); Audio resulted to distract more the local users (Q1.6) in both L1 (\(Z=2.44, p=0.014,Md=3.5\) vs. 1) and L2 (\(Z=2.46, p=0.014, Md=2\) vs. 1); Audio determined the local users to be more overloaded with notifications about the remote user’s activity (Q1.7) in both L1 (\(Z=2.74, p=0.006, Md=4\) vs. 1) and in L2 (\(Z=2.69, p=0.007, Md=2.5\) vs. 1) and Audio made local users feel less successful in accomplishing the task in L2 (Q1.16: \(Z=-2.16, p=0.031, Md=2.5\) vs. 4).

Visual Versus Audio

Local users could follow the remote user’s instructions (Q1.5) better in Visual (L1: \(Z=2.33, p=0.019, Md=4.5\) vs. 4); the Visual also caused less distraction with information about the remote user’s activity (Q1.6) (L1: \(Z=-2.28, p=0.023, Md=2\) vs. 3.5) and Visual caused less information overload to the local user about the remote user’s activity (Q1.7), in both L1 and L2 (L1: \(Z=-1.98, p=0.048, Md=2\) vs. 4; L2: \(Z=-2.14, p=0.033, Md=2\) vs. 2.5).

Visual Versus No

Visual helped local users to be more aware of when the remote user was doing an action (Q1.4) (L2: \(Z=2.36, p=0.018, Md=4\) vs. 3); visual notifications caused more distractions (Q1.6) (L2: \(Z=2.27, p=0.023, Md=2\) vs. 1) and were more overloading on the local user (Q1.7) in both L1 (\(Z=1.99, p=0.046, Md=2\) vs. 1) and L2 (\(Z=2.33, p=0.019, Md=2\) vs. 1); with visual notifications, local users felt less dependent on the remote user in completing the tasks (Q1.10) (L1: \(Z=-2.71, p=0.007, Md=3\) vs. 4); visual notifications made local users feel more successful in accomplishing the tasks (Q1.16) in both L1 (\(Z=2.35, p=0.019, Md=5\) vs. 3) and L2 (\(Z=2.23, p=0.026, Md=4\) vs. 2.5).

As a conclusion, the analysis of the Likert responses shows strong statistical evidence that visual notifications cause less overload and less distraction than Audio. Visual and Audio notifications are more successful for making the local user aware of the remote users’ activities and for accomplishing tasks compared to No notifications. Alongside the statistics performed above, Fig. 13 comes to strengthen the position of the visual notification as the most preferred condition for providing workspace awareness.

Fig. 13
figure 13

Most preferred (left) and least preferred (right) condition for game levels (L1, L2), according to the subjective responses given to Q2.1

Results from Discussions

From the debriefings, we received positive feedback on the overall performance of our AR environment, but we also got some interesting suggestions for future developments.

Most of the participants considered that the visual notifications, as currently implemented, are the best way to be informed about remote users actions, compared to audio or no notifications.

Some participants mentioned that certain notifications (like those for selecting and deleting an object or freezing/ playing live streaming) were not useful for them in completing the task. An interesting idea we received was to implement a customizable notification system, in which the local user could choose the actions to be notified on.

In some cases, the participants complained about the auditory clutter caused by overlapping of the audio notifications. This happened when the remote user performed many actions in a short amount of time and an audio notification was generated before the previous one was completed. But in the same time, they admitted that speech notifications are necessary if they had to make distinction between many actions of the remote user. For maximum two or three actions, a non-speech audio would be preferred instead of spoken words.

Another suggestion was that the visual notification for freezing the image should be displayed all the time until the remote user plays live streaming again. Three of the participants said that in a small workspace no notifications are needed, but in the same time they admitted the benefit of notifications in a bigger environment. One participant said that it would be very helpful to have visual notifications that indicate the position where an object was added (e.g., using arrows for 4–8 directions).

Another idea was to use audio only in few cases (considered more important), for the rest of the notifications visual being a better option, or to have visual notifications for all the remote users action but for important action (that can be defined in a priority list) these visual icons should be accompanied by a non-speech audio signal.

Critical Observations

Limitations of the User Study

Our user study was conducted in a “controlled” environment, with no serious consequences for participants who did not manage to successfully complete the tasks. Although we consider that our findings are valid in many real life situations, there is an open question how difficult circumstances may affect a person’s experience when using our system.

Limitations of the Current Setup

In our experiment, the local user was sitting in front of a table and had to solve a 2D assemble puzzle, a task that required a short range distance alignment between the virtual and the real world. As the real objects in focus are close to the local user’s eyes and to the RGB camera, the movements of the head induce a great variation of the position of the area in the RGB image that matches the display area in the HMD view of the local user. Through many different empirical trials in which we altered the position and the dimensions of this area, we managed to choose a fixed area that worked well for the majority of the participants in our experiment (see the transparent rectangle in the middle of the image in Fig. 8). However, the chosen fixed area did not support high accuracy for the alignment of virtual and real objects. For instance, we were not able to precisely place virtual annotations for real objects that were small and close to one another (e.g., keys on the keyboard of a laptop or squares on the board game in our experiment). This limitation led us to text messages instead of visual icons to indicate squares on the board game.

Conclusion and Future Work

A quick and adequate exchange of visual context-related information to establish a common ground is necessary in order to make proper decisions and to avoid costly mistakes that cannot be easily undone. AR systems have successfully be used to establish such a common ground via virtual colocation. User studies showed that the workspace awareness of a local user needs to be improved during virtual colocation. For that purpose, we explored in this article on how to increase the workspace awareness of a local user who is connected to a remote user. The remote user provides instructions to solve puzzle tasks by using an AR system for virtual colocation. We implemented automatic audio as well visual notifications that are generated whenever the remote user interacts with the system. Each notification is sent to the local user to inform them on the action that has just been taken by the remote user.

We reported on a user study to explore the impact of audio and visual notifications about the remote user’s actions on the workspace awareness of the local user. We used a game as research instrument for the experiment in order to set up a valid, repeatable and observable experiment. Although a well-grounded method in game research is still lacking [1], and requirements for games as research instruments are not very well defined yet [33], the game we used allowed us to study virtual colocation in a spatial task. In future research, we would also use different types of games for being able to better understand and define the role of distinct game elements in the research process.

The results of our study show that the local users prefer visual notifications over audio and no notifications. It is interesting to see that the visual notifications cause less overload for the test persons. This could be explained by the fact that the task in the game we used requires visual attention already, and an audio signal would mean that participants would have to divide attention between two cues (audio and visual), instead of staying focused on only one (visual). For AR systems this would mean that limiting cues to one mode would mean a benefit for the user. Future research should contribute to a consolidation of this aspect by including further experiments with different modes of notifications. Also, we consider extending the area of awareness for the local user by adding the “where” category as described by [25]. This means providing information on the location of remote user’s activities. For that purpose, we will make use of an inertial measurement unit (IMU) mounted the HMD of the local user to be able to determine the current position of the local user relative to the previous position when the remote user “froze” the image in order to perform an activity.