Pointing in the third-person: an exploration of human motion and visual pointing aids for 3D virtual mirrors

A “virtual mirror” is a promising interface for virtual or augmented reality applications in which users benefit from seeing themselves within the environment, such as serious games for rehabilitation exercise or biological education. While there is extensive work analyzing pointing and providing assistance for first-person perspectives, mirrored third-person perspectives have been minimally considered, limiting the quality of user interactions in current virtual mirror applications. We address this gap with two user studies aimed at understanding pointing motions with a mirror view and assessing visual cues that assist pointing. An initial two-phase preliminary study had users tune and test nine different visual aids. This was followed by in-depth testing of the best four of those visual aids compared with unaided pointing. Results give insight into both aided and unaided pointing with this mirrored third-person view, and compare visual cues. We note a pattern of consistently pointing far in front of targets when first introduced to the pointing task, but that initial unaided motion improves after practice with visual aids. We found that the presence of stereoscopy is not sufficient for enhancing accuracy, supporting the use of other visual cues that we developed. We show that users perform pointing differently when pointing behind and in front of themselves. We finally suggest which visual aids are most promising for 3D pointing in virtual mirror interfaces.


Introduction
Providing a user with a 3D mirrored view of themselves (such as in Fig. 1) in a virtual environment can create a useful interface when the more common first-person perspective is not sufficient. This view may be preferred if the application requires the user to be fully aware of what their body is doing, be aware of what is behind them, or be able to see themselves with some augmented image. Such interfaces were seminal in early VR systems displaying a mirrored silhouette or video of the user that can interact with its surroundings (Vincent 1993;Krueger 1985). More recently these interfaces can be found in games that use a Microsoft Kinect to drive an avatar, applications that help users produce specific body movements (Hülsmann et al. 2019), and certain applications for education (Woodworth and Borst 2017;Malinverni et al. 2017;Bork et al. 2017). As the applications grow, we expect that developers will want users to be able to point accurately in "virtual mirror" environments. For example, movements for rehabilitation or movement training tasks in a virtual mirror may benefit from precise targeting cues.
Pointing in virtual reality environments is a basic interaction method often needed for selection or communication (Poupyrev et al. 1998). As such, pointing has been heavily explored in the context of a common first-person VR perspective (Argelaguet and Andujar 2013), and many techniques and visual aids have been developed to enhance pointing accuracy. However, to our knowledge no major works have addressed these topics for a third-person perspective in which the user sees themselves from an exocentric view. A mirror view naturally uses a third-person perspective (Preston et al. 2015), thus giving us reason to believe first-person pointing techniques may not be applicable. Some works assist pointing with an exocentric main view (Stoakley et al. 1995;Mine et al. 1997), which may initially seem to relate to pointing in a mirror. However, 1 3 these solutions do not deviate from the first-person perspective in terms of the pointing aspect and only offer to expand the portion of the virtual world the user can see (aspects like the interaction tool are still viewed and manipulated in first-person).
There is notable difference between pointing in a firstand third-person perspective that warrants investigation into a difference in visual aids. One key difference is that the pointing origin and direction are seen in a more indirect and disconnected way that may limit a user's depth awareness. Targets are also naturally seen from a farther egocentric distance, increasing the chance of misjudged target position (Lin and Woldegiorgis 2017). Another difference is that a user may need to point at targets to their side or behind them without the ability to rotate to adjust their view. This creates three main regions (front, back, and side) in which we expect there to be a difference in arm motion.
As example motivation we look to the growth of crossreality applications (Pazhayedath et al. 2021). Due to VR's status as popular, yet non-ubiquitous, developers of VR meeting spaces (e.g., Virbela, Mozilla Hubs, VRChat) support conventional desktop interaction in addition to their VR interfaces. Recently, some have used these tools to supplement education (Yoshimura and Borst 2021) and academic conferences (Ahn et al. 2021). However, in these works authors have found that even users with access to VR equipment may prefer non-immersed alternatives for longevity, comfort, and increased access to resources. This has prompted the design of new alternative cross-reality interfaces (e.g., Woodworth et al. (2022)), which we consider warrant further investigation.
In one case, Borst et al. (2018) used a virtual mirror interface to help teachers guide a virtual field trip. The teacher was given a mirrored view (such as in Fig. 1), and their 3D image was captured by a Kinect and streamed to headsetimmersed students who viewed it as a 3D avatar in the environment. Authors noted that teachers were not initially able to point at the correct depth when describing objects in the environment, nor were they even aware of their inaccuracy due to an ambiguity in the pointing direction. This problem necessitates the design of visual aids to assist in pointing accuracy.
This paper presents studies to evaluate these visual aids, as well as gain insight into third-person pointing in general. First, a preliminary study presented subjects with nine different visual aids, and was designed primarily to fine tune the individual techniques and eliminate those that substantially underperformed in terms of time and accuracy. Results from this stage had limited statistical power (10 subjects) and could not definitively show which visual aid, if any, was best. Additionally, it did not investigate unaided pointing and did not fully consider stereoscopy, the inclusion of which may reveal more information about how people point in a mirrored third-person view.
A subsequent main study provides a more thorough investigation and addresses the preliminary study's limitations through detailed evaluations of those visual aids loosely identified as superior in the preliminary study, and by more closely evaluating subject motion during pointing, with 33 subjects. In the study, subjects performed a task similar to that of a standard Fitts's Law experiment in which subjects used visual aids to point at targets and select when they think the pointing is accurate. We first present an introduction phase, in which subjects performed a pointing task unaided and with minimal instruction to provide insight into initial pointing problems. Then in the main comparison phase, subjects repeated the tasks with the visual aids to determine how they affect pointing motion. The comparison phase also presented the baseline unaided task (in random order between the visual aids) to determine if subjects behaved the same after "training". In each phase, the visual aids alternated being shown with and without stereoscopic cues, to determine if stereoscopy affects pointing accuracy or ease.
Our contributions include: • The design and analysis of several visual aids meant to increase pointing accuracy and confidence. • Insight into the difference between pointing in first-and third-person perspectives. • A demonstration of how stereoscopic depth cues do not inherently increase pointing accuracy. • A demonstration of the difference between aided and unaided pointing motion over time. • An analysis of pointing motion to targets behind and in front of the user's avatar.

Related works
Our research surrounds two major research domains: virtual mirrors and pointing in VR. Pointing in 3D environments for selection or communication is a fundamental task that has been studied extensively (Argelaguet and Andujar 2013).
Research suggests that it is essential to give users some form of visual aid or feedback when pointing, as humans are naturally able to point more quickly and accurately in the real world compared to a virtual environment (Liu et al. 2009; Barrera Machuca and Stuerzlinger 2019). As such, we review several popular and recent works in aided 3D pointing, and establish a need for new work considering the virtual mirror's perspective and design. We review literature surrounding virtual mirrors to show common applications that use the interface and establish the need for pointing in this type of interface.

Virtual mirrors
Virtual mirrors have their origins in applications, such as 2D "Mirror Worlds" (Vincent 1993) and Videoplace (Krueger 1985) that used 2D video techniques to make the user's silhouette interact with the space around them. These applications typically allowed the user to overlap their silhouette with a virtual object to select or interact with it. For example, one application allowed a user to play a virtual drum kit by hitting their hands' silhouettes against the virtual drums. In another, users interacted with each other using 2D pointing with their silhouettes. Similar techniques were used in the commercial PlayStation EyeToy games, which often superimposed game imagery on top of the camera's image. We consider this to be a 2D analogue to the virtual hands interaction technique common in early 3D VR applications. More recently, virtual mirrors have been popular in virtual try-on applications, which overlay images of clothing onto the user to allow them to "try on" other outfits (Eisert et al. 2008;Hauswiesner et al. 2013). These applications are limited in their user interaction, primarily using an external touchscreen (Hilsmann and Eisert 2009) or very basic gestures (Straka et al. 2011) for selection.
Certain educational applications have also used a virtual mirror interface. For example, Blum et al. (2012) created an augmented "magic mirror" interface that drew images of organs and bones on top of users to teach anatomy. They developed a type of virtual hand metaphor for interaction, allowing the user to perform multi-hand gestures, such as a zoom, by defining an interaction plane at a certain depth, and blurring the users hand until they reach the plane. A similar application by Bork et al. (2017) showed that not reversing the mirror image improved users' learning, though interaction was handled exclusively through an external input device. Hülsmann et al. (2019) created a virtual mirror in a CAVE in which users learn proper squat form through superimposed imagery of a skilled performance, though no selection or interaction was required from the user. Malinverni et al. (2017) created a serious game in which children with autism improved social skills by interacting with other users in the virtual mirror. The game included a stage in which children had to point at an enemy creature using a laser metaphor to defeat it, though the pointing was only done in 2D.
Certain other interface techniques that use a mirror have been proposed. For example, Thanyadit and Pong (2017) proposed giving the user a mirrored view of their hands when interacting with desktop VR, as it enabled more accurate physical interaction between the hand and virtual objects. We consider that these applications and others could be substantially further improved with more natural and immersive interaction that allows them to select objects within the 3D virtual world.
Mirrors have also been used inside virtual worlds to evaluate the sense of body ownership and agency in virtual reality and their effects. For example, Peck et al. (2013) experimented with reducing implicit racial bias by putting users in an avatar with a different race. The user was given a mirror in the virtual world to allow them to see themselves controlling the avatar. Similarly, Banakou and Slater (2017) had users view their avatar in a mirror while receiving tactile feedback as a virtual ball bounced across their body. The avatar was given a voice with a different fundamental frequency than the user's, with the intent being to see if the user's own fundamental frequency would change after associating with the avatar. Preston et al. (2015) evaluated body ownership in mirrored third-person views compared to standard third-person, finding that ownership in a mirrored view was similar to that of a first-person view.

3D pointing
The domain of assisted 3D pointing and interaction can be split into providing exocentric and egocentric views (Poupyrev et al. 1998), with egocentric being substantially better understood. Egocentric views are typically defined as giving a first-person perspective, as is most common in Virtual Reality applications. Many interaction types using this view can be further broken down into virtual hand metaphors and virtual ray metaphors.

Virtual hands
Virtual hand metaphors are typically the simplest to use and understand, as they exploit natural proprioception. Virtual mirror applications employing natural interaction typically use a variant of this metaphor. A standard virtual hand metaphor only requires the user to move their hand to the desired object to select or grasp it (Prachyabrued and Borst 2014). However, this limits the interaction range to only the area the user is able to naturally move their hands in. Several works have attempted to address this problem by altering the control-display ratio (e.g., the go-go (Poupyrev et al. 1996) and stretch go-go (Bowman and Hodges 1997) techniques), but this approach forfeits the afforded natural proprioceptive cues and decreases accuracy for long distances (Argelaguet and Andujar 2013).
We consider that users of a virtual mirror system will not want or not be able to move large distances in the real world due to limited camera views or space, making a virtual hand's limited range problematic. Additionally, we expect that modification of the control-display ratio will be confusing for the user, considering they are able to see their real hand as well, and inappropriate for communication. Thus we consider these techniques to be unsuitable for virtual mirror applications that give the user a wide field of view.

Virtual rays
Virtual ray metaphors are the dominant form of selection metaphor and address the limited range provided by virtual hands. These metaphors are also more conceptually similar to pointing. Techniques typically project an object in the pointing direction, and an object that is intersected is considered to be selectable. The object typically takes the form of a line (Bolt 1980) or a cone (Liang and Green 1994).
Virtual rays can suffer from precision problems and the need to disambiguate multiple targets are intersected. Lines require more precision than cones due to their smaller volume, but only require disambiguation if objects occlude one another. This can be mitigated by allowing the user to select targets along the line when it intersects more than one (e.g., the Depth Ray (Grossman and Balakrishnan 2006)). Cones have greater volume, increasing consistency of precision at different distances but potentially intersecting more targets, requiring better disambiguation techniques. This disambiguation can be done automatically (e.g., Flashlight (Liang and Green 1994)) or manually (e.g., PRECIOUS (Mendes et al. 2017)).
We expect virtual ray metaphors to provide better usability compared to virtual hands when in a virtual mirror with a wide field of view, as they allow the user to interact more easily with distant objects. However, no previous research has studied their effects when used in third-person, so it is unknown what virtual ray techniques are best for such an interface.

Exocentric metaphors
An exocentric view is one in which users see and interact with a third-person perspective of the virtual environment (Poupyrev et al. 1998). Selection metaphors using this type of view would initially seem to be more closely related to the problem of pointing in a virtual mirror. However, we note that these techniques often only make the world more accessible to a first-person view, rather than offering a thirdperson view of the user. For example, the classic World-In-Miniature technique (Stoakley et al. 1995) presents the user with a third person overview of the environment, but the user still selects objects with their hand seen from a first-person perspective. A similar technique by Mine et al. (1997) allows the user to scale the world to bring objects closer to them to be selected in first-person. The more recent Kinespheres technique (Lubos et al. 2016) aligns multiple targets on a sphere within the user's arm's reach to select in a first person perspective.
This can be contrasted to pointing in a virtual mirror, in which the actual selection tool and user avatar are seen from a third-person perspective, and users may have to point at targets in front of and behind themselves. Because of this, we establish that 3D pointing solutions have not been adequately studied for a mirrored third-person view of the avatar. We take inspiration primarily from virtual ray metaphors, as we expect the user to need to point at distant targets.

Description of visual aids
Our visual aids were primarily developed to give users a better sense of the depth at which they were pointing. During development of a virtual mirror interface for remote teaching (Ekong et al. 2016), it became clear that users were able to naturally find a target's elevation, thus making the pointing appear correct on their 2D monitor. However, they often did not notice how deep into the scene a target was, leading them to primarily point just to the side and miss targets that were behind or in front of them. Correct depth targeting was considered critical in a shared VR space with other users observing the pointing from varied viewpoints.
In developing the visual aids, we considered two main pointing scenarios based on the availability of information about the targets to the system. When the environment presents a minimal number of possible targets, the system may know the exact location of a few potential pointing targets. In this scenario, visual feedback can be given based on the angular error between the user's pointing direction and the target. When pointing for communication, for example, in a collaborative VR application, the system may not know which objects are potential objects of interest, or there may be too many to display pointing aids for all. In this scenario, a visual aid would need to provide a more general sense of accuracy. We suspect that visual aids that are based on known targets will allow the user to point more quickly and accurately, as they will be able to convey more information about the user's pointing.
All visual aids operate with the input of a determined pointing direction. In order to develop an interface that could work without a hand-held input device, such as a tracked wand, this pointing direction is determined by information given from a Kinect device. The Kinect is able to report the position of all major "joints" of the user's body. From this, we define the pointing direction as a ray from one joint to another. In our experiments, this ray is defined to be between the user's elbow and palm. While other joints were considered, this configuration was chosen for the stability of the reported position data. Using the ray from the palm to the finger may, in theory, provide a direction more in line with how the user thinks they are pointing, but the reported positional data was too unstable for practical use.
In total, nine aids (seen in Fig. 2) were used in the preliminary study, with the intent to narrow down to the best for a larger study. In the preliminary study, users were asked to adjust certain parameters for each visual aid (seen in Table 1 and explained below), with each user performing a test with their personal preferences. Four visual aids (seen in Fig. 3) were deemed superior in terms of accuracy, time, and subjective user rating, and were brought into the larger main study. We provide details for each visual aid, with an analysis on how users adjusted them in the preliminary study's adjustment phase in the following section.

Target rods (TR)
When the user is pointing, a translucent cylinder (rod) is extended along the vector from the target to the virtual elbow position. Depth information is reinforced through the rod's color and the scaling of the rod's ends based on the depth (azimuth/lateral angle) between pointing-angle and angle-to-target. A green rod signifies accurate pointing. Users are intended to move their arm until it is visibly "inside" the rod, making it appear more like an extension of the arm.
The placement parameter defined if the rod's full length was extended from the user's elbow to the target, it was half length and attached to the elbow, or it was half length and attached to the target. The scale style parameter defined if the cylinder end facing the user expanded or shrank.

Target map (TM)
A top-down camera high above the user's elbow renders known targets to a minimal visible quad hovering near the user's mirrored avatar. A line segment is drawn from its center to represent the current pointing azimuth, so the line intersects with the target representation when the user is pointing at the correct depth. The appearance resembles a circular gauge or dial indicator.
The placement parameter defined if the map hovered near the user's head, shoulder, or moving pointing hand. The size parameter defined the length of each side of the quad.

Sphere cutout (SC)
A mostly-invisible sphere is centered on the elbow with a radius matching the distance to the target. A section of its surface is made visible to show the user's current error in pointing. Specifically, two points on the sphere are defined: the point at which the target lies and the point where the user's pointing direction intersects the sphere. A shape is drawn between these two coordinates; the user is intended to minimize the shape for more accurate pointing. The effect is an angular analog to the rubber band selection tool common in 2D interfaces.
The shape parameter defined if the shape rendered between the two points was rectangular or circular.

Cone compass (CC)
Similar to a 3D arrow guidance cue (Burigat and Chittaro 2007), a small cone glyph points toward the target or hand. Its color changes to reflect pointing accuracy, with green signifying accurate pointing.
The placement parameter defined if the cone was placed near the hand or the target. The style parameter defined if the cone acted as a compass, pointing directly at the target or the hand (the opposite of its placement), or as a simple glyph pointing up or down according to pointing depth error. A Translucent rods that guide the user toward the target. From left to right: The user points behind the target, turning the rod red and expanding the farthest end. The user points at the correct depth, turning the rod green and making both ends roughly equal. The user points in front of the target, turning the rod red and shrinking the nearest end. B A "rectangular" section of a sphere is drawn to show the user's pointing error. C A thin translucent fan is attached to the user's hand, with different colors on both sides of the fan to help indicate pointing height. D A thin wand extends out of the users hand; its size adjusts to the nearest intersection point

Pointing wand (PW)
A ray-like technique using a thin cylinder with a striped texture is placed on the hand. The wand acts as an extension of the arm, pointing in the user's pointing direction, and users can determine accurate pointing when it intersects an object.
The length parameter defined if the wand had a static length (long enough to reach past the furthest target) or changed length to exactly touch the closest pointing intersection point. The material defined if the wand was transparent or opaque.

Pointing fan (PF)
A translucent vertical fan is attached to the hand. A thin cylinder extends out from the user's hand in the pointing direction, similar to the wand, and two fins are extended from its top and bottom. Objects behind the fan are at least partially visually occluded, giving the user a sense of their pointing depth.
The length parameter defined how far out the fan object extended from the hand. The material parameter defined if pixels on the fan would turn white when close to another object, or be rendered with a static transparency and color.

Hand camera (HC)
A virtual camera is placed on the user's pointing hand, with the frustum aimed in the pointing direction. The camera's image is textured onto a visible quad hovering near the mirrored avatar. A light reticle is also drawn on the image to emphasize its center, where targets appear when pointing is accurate.
Given hand camera's similar rendering style to target map, they shared placement and size parameters. The zoom parameter controlled the camera's field of view, effectively manipulating its "zoom."

Hand light (HL)
Multiple hand-mounted lights are arranged to project a red and white bullseye reticle pattern, which appears on any objects pointed to. The size parameter controlled the projection angle of the bullseye light, and the intensity parameter controlled the light's intensity (brightness).

Pointing slits (PS)
Thin slits of light are projected out from the user's hand in the pointing direction. A thin vertical slit can be seen along the ground and shows the pointing depth.
The width parameter controlled the projection angle of the slits, making them wider or thinner. Similar to hand light, intensity controlled light brightness. The configuration parameter defined additional rendering options: horizontal slit, centered on the hand's pointing direction to show pointing elevation, and a bullseye, placed at the intersection of both slits.

Preliminary study
We conducted a two-phase user study to determine good configurations for each of the nine visual aids and then compare the techniques. Subjects adjusted each visual aid in one phase, and then performed a Fitt's-Law-like test using their personal configurations.

Subjects and apparatus
Ten subjects (nine Male, one Female) participated. Subjects were screened to confirm normal or corrected vision and normal limb function. Nine were from a local computer science department. Four subjects had prior VR experiences, eight had extensive video game experience, and all were right handed.
Subjects stood approximately 2 ms in front of a 190 cm Samsung 3D TV and a Microsoft Kinect. As illustrated in Fig. 2, the virtual environment (created with the Unity3D engine) contained seven spherical targets on each side of the subject. Targets were placed on pedestals of varying heights and at varying depths from where the subject was asked to stand.

Parameter adjustment phase
The first phase had subjects freely adjust certain parameters of the visual aids (listed in Table 1) while practicing pointing at targets in the environment. It consisted of nine rounds, one for each aid, with order generated using a Latin Square. The subject was given a wireless mouse to hold in the pointing hand for use as their interaction device.
One target was shown at a time, and the subject could freely change to other targets by left-clicking the mouse button. The subject could modify one parameter at a time by scrolling the mouse wheel. Once they reached the preferred value for that parameter, they notified the experimenter and were moved to the next parameter. Once all parameters were tuned, the subject could move to the next round or readjust any other parameters at will. The configurations chosen during this phase were used for the second phase of the experiment.

Performance and ranking phase
During the second phase, the subject was presented with 11 rounds of a Fitts' Law-like test. The nine visual aids as well as two baseline conditions (stereo 3D and an unaided monoscopic view) were presented in a random order. Stereo 3D was not shown in tandem with the visual aids in order to focus specifically on the sense of depth given by the visual aid on its own. The effects of stereo 3D used in tandem with visual aids was later explored in the main study. Active shutter 3D glasses were given to subjects when entering the stereo round, and were taken back when completed.
In each round, the subject pointed at and selected (leftclicked) targets as quickly as they could while still maintaining a sense of accuracy. Accuracy was not enforced, as it would not be enforced in a collaborative communication context, so time to click was left to a subject's judgment. A single round consisted of a subject selecting 14 targets (each target appeared twice in random order) 3 times, with a short rest between the three repetitions. Click time and error data were recorded.
At the end of each round, subjects were asked to score the condition from 1 to 10 based on how the visual aid (or lack thereof) allowed them to quickly and accurately find targets. Subjects were also asked to explain anything they liked or disliked about that condition.

Results
We present the user preferences from the adjustment phase and compare results of our visual aids and baselines from the ranking phase. For time, accuracy, and subjective measures, we base discussions on means, distributions, and visual inspection, as the large number of conditions and low number of subjects leads leads to limited statistical power, especially if familywise error is constrained by techniques like bonferroni correction. We use these findings to justify selections of visual aids for the main study, in which we perform more rigorous statistical tests.

Adjustment phase preferences
Target Rods While there was no general consensus on preferred scale style, the majority (6) of subjects preferred for placement to fully extend the rod from elbow to target, as it enabled them to see when their arm was fully immersed in the rod while still seeing it reach the correct target.
Target Map A majority of subjects (8) preferred a static placement, but showed no preference between the shoulder and head. Preferred size spanned from .5 m to 1 m in width and height, with the mean preferred size being .77 m.
Sphere Cutout A majority (6) preferred the rectangle shape due to it showing error distinctly in both azimuth and elevation.
Cone Compass A majority of subjects (6) preferred the cone's placement near the hand, and 9 preferred the compass style, pointing it to the target.
Pointing Wand While there was no consensus on the wand's material, every participant preferred its length to dynamically extend to exactly the first intersection point, as it more clearly showed the first-hit target.
Pointing Fan A majority of subjects (9) preferred the material that indicated depth with whitened pixels, and subjects always chose a length that was as far as the farthest target in the scene or higher.
Hand Camera A majority of subjects (6) preferred the quad's placement on the shoulder. Preferred size ranged from .61 ms to 1 m, with the mean being .88 m. Preferred zoom (field of view) ranged from 34 to 61 degrees, with the mean being 48 degrees.
Hand Light A majority of subjects (6) preferred the highest possible light intensity. The preferred size ranged from 11 to 41 degrees, with a mean of 19 degrees, indicating preference for a smaller, more precise bullseye.

Pointing Slits
While there was no consensus on configuration, a majority of subjects (7) preferred the added reticle. A majority (6) preferred light intensity to be as high as possible. The preferred width, determined by the light's field of view, ranged from 57 to 151 degrees, with a mean 115.

Accuracy
The main purpose of the visual aids was to improve the users' accuracy in pointing depth and discourage simply pointing to the side. We measure the accuracy for each cue in terms of angular error between the pointing direction and the ideal direction to the target. Error is broken into azimuth (lateral) and elevation angles to facilitate analyzing pointing depth specifically. Angle error results are seen in Fig. 4.
For azimuth error, both baseline conditions (stereo 3D and monoscopic without visual aids) appear less accurate than the visual aids. Subjects seemed to point in front of the target when unaided. When using stereo 3D, accuracy may have improved somewhat, but not substantially, suggesting stereoscopy on its own was not enough to correct pointing. Most visual aids appear to have performed similarly well with the exception of hand light and pointing slits which may have resulted in pointing farther in front of the target.
Elevation error from baseline conditions appears less severe, but still suggests subjects pointed higher than with visual aids. Most visual aids appear to have resulted in similar accuracy except hand camera and pointing slits which show subjects pointed farther below the target.
More important than average accuracy is variance in accuracy. A high (far-from-zero) average error could be due to an error in our estimation of pointing direction and error calculation. A high variance, however, would indicate that users were unaware of where they were pointing regardless of our calculation method. Distributions in Fig. 5 show a wider spread in azimuth errors for the baseline conditions when compared with visual aids. This is consistent with users only performing coarse pointing without visual aid.
Notably, some visual aids provided good accuracy for one angular axis and not the other. For example, it appears the target map was accurate for azimuth, but had a high variance for elevation. We expect this was due to users not being concerned with finding the correct elevation when there was no related feedback. Similarly, the pointing slits suffered a lack of precision in elevation due to minimal or confusing elevation feedback. The hand camera also appears to have resulted in inaccurate elevation; based on verbal feedback we Rating scores given to each technique by subjects. Triangles denote means, pluses denote outliers, and the horizontal line denotes medians infer the cause was subjects aiming to get the target within the camera's frustum rather than specifically in the center. Because of this, we consider these three techniques to be inappropriate for use.

Time
Time-to-click can be seen in Fig. 6. The monoscopic baseline appears to have incurred less time than all other conditions. Stereo 3D appears to have incurred slightly more time, but appears similar to target map and sphere cutout. Subjects reported they felt able to click most quickly when completely unaided, but also reported no confidence in their accuracy. Responses were similar for stereo 3D. We infer again that an interface without pointing aids encouraged coarse pointing behaviors, even among subjects who had already practiced with aids that reinforce better behavior.
The cone compass appears to have taken significantly longer than all other conditions. Subjects reported difficulty knowing which direction to move to correct their pointing due to the user's perspective and its conical shape. Because of this, we consider this technique inappropriate for use.

Subjective ratings
Subjective ratings are seen in Fig. 7. When compared to monoscopic and stereo 3D baselines, all visual aids except cone compass, hand light, and pointing slits appear to be rated better. Sphere cutout, target rods, pointing fan, and pointing wand appear to have among the highest median ratings and the lowest spread of scores (i.e., high minimums).
For practical and frequently used applications, we would want to use visual aids that not only have high ratings, but have few low ratings and little variance. For these reasons, we consider the hand light technique to be inappropriate for use.

Discussion
Based on the criteria of providing good selection time, high accuracy, and strong user preference, we identify the sphere cutout, target rods, pointing fan, and pointing wand as visual aids worth evaluating in a larger study. Identifying common traits among these, we see that all four involve an object that is extended out from the virtual body to the target. The sphere cutout and target rods also give clear indication of error while keeping the important visual part of the aid near the target.
In analyzing visual aids that were deemed bad, ones that performed a 2D projection of an image onto the environment (hand light and pointing slits) were disliked because projection warped user perception of where the aid told them they were pointing unless they were already correct, leading to a high time-to-click. This is in contrast to a first-person view in which this projection warping is not as severe. Visual aids that encouraged looking at some auxiliary information other than the target (hand camera and target map) were also disliked and inaccurate. Visual aids that provided feedback for primarily the azimuth axis (target map, and to a lesser extent hand camera and pointing slits) reinforce accuracy in that axis while allowing users to forget to correct the elevation axis. We can thus infer that the most important features of this type of visual feedback are: the primary feedback visuals being near the target, the feedback mechanism including some 3D component along the pointing vector, and feedback reinforcement in both azimuth and elevation.
These findings show that with well-designed visual aids, we can increase pointing accuracy with a small increase in targeting time, which we consider worthwhile due to the benefits of improved communication. Due to the wide variety of visual aids tested and results achieved for each aid, findings also hint at key design components to include or avoid for a successful visual aid.
However, results are somewhat shallow due to some aspects of the study design. A small number of subjects led to low statistical power and inability to use corrected pairwise tests. Not all subjects used the same visual aids in the second phase, due to each subject using their own preferred tuning. And, analysis only considered metrics taken at target-selection-time. Our main study, presented in the next section, addresses these concerns.

Main study
Following the preliminary study, we conducted a main two-phase within-subjects user study to achieve four main analyses.
1. To see how people behave in general when told to point at targets in a mirror without special techniques or visual aids, and then measure how they improved their unaided pointing after several rounds of practice. 2. To more deeply compare the best visual aids from the preliminary study to determine what, if any, differences there are between them. Techniques were compared more deeply in terms of their accuracy, time taken to select a target, motion over time, and subjective user ratings. 3. To measure the effect of stereoscopy when pointing. We expect stereoscopy to make a difference, as it was shown to differ from a monoscopic baseline in the preliminary study. In this experiment, we compare stereoscopy combined with each visual aid, instead of just as a baseline as in the preliminary study.
4. To see if pointing behavior differs when users point in front of or behind them. We consider this an important factor since this notion of different regions is a substantial factor when pointing in a third-person perspective.

Subjects and apparatus
33 subjects (30 male, 3 female) participated. All subjects confirmed they had normal or corrected vision and normal limb function. Mean subject age was 22.2 years, with a minimum age of 19 and maximum of 27. Subjects were primarily gathered from a local university, with 15 coming from the computer science department, 16 from mechanical engineering, and 2 being non-students. 16 indicated little to no experience with virtual reality, 11 indicated infrequent use, and 6 indicated frequent use, primarily in the context of gaming. 1 subject had participated in the preliminary study a year before; this subject's data was shown to not differ significantly from that of the rest of the subjects, thus we decided to include the subject's data. Subjects stood in front of the same virtual mirror interface from the preliminary study involving the 3D TV and Microsoft Kinect. Subjects again used a wireless mouse as a trigger as in the performance phase of the preliminary study. Data collection in this experiment compensated for the slight jitter that may be introduced in the pointing hand at click time by measuring the full movement over time; the analysis compensates by measuring accuracy variance, which is affected regardless of slight movements.

Environment
The environment was constructed entirely within the Unity engine. Spherical targets were placed on the right side of the subject on two 3 × 3 spherical grids, with each region (i.e., grid cell, section of the sphere surface) of the grid occupying a 60° × 60° space. The spherical grids were centered at the user's waist, with a radius of 1 and 2 ms, creating an inner and outer sphere. To avoid reliance on muscle memory, the targets were jittered from their starting position within each region. Each target was drawn atop a cylindrical pedestal, as other studies (Stuerzlinger and Teather 2014) have deemed it difficult to visually discern depth without such locational cues, and we are primarily interested in time taken to cognitively process and point to a depth, and not time spent understanding object location.

Methods
The experiment task was similar to the performance phase of the preliminary study. In summary, users were given a set of targets to point at with or without aid, and were asked to point at and click to select targets as quickly as possible while maintaining a sense of accuracy. Accuracy was not enforced so that error could be measured when the pointing was perceived to be accurate.
The study was divided into two phases: introduction, in which subjects experience pointing unaided for the first time, and comparison, in which they experienced the four visual aids and compared with an unaided baseline (order randomized by latin square). The two phases were broken into a number of rounds, 1 for introduction and 5 for comparison. Each round tested a single visual aid or an unaided baseline, and consisted of 4 stages with alternating stereo modes. The first stage's stereo mode was chosen randomly, and each following stage alternated the stereo mode between mono-and stereoscopic. Users were given time to rest and put on or take off the stereo glasses in between stages.
Each stage consisted of 9 tests. A single test began with the user resting their hand at their side, and following the repeating steps: 1. A target appears 2. The user points toward the target 3. The user selects (left clicks) when they perceive that they are pointing at the target; the target disappears on click 4. A small box is drawn near the user's right hip, indicating that they should bring their hand back to a resting position 5. The user waits two seconds with the hand in a resting position, with each second being announced by a loud beep 9 targets (corresponding to 9 tests) out of a generated set of 18 were shown to the user during each stage. At the start of the stage, one target is generated within each region (grid cell on a sphere) of the two 3 × 3 sphere grids. One target is picked from each region on the grid, randomly chosen from the target on the outer or inner sphere. After each round of the comparison phase, subjects were asked to give brief comments about visual aid or baseline and give a rating score from 1 (lowest) to 10 (highest). Subjects were told to rate based on how quickly and easily they were able to point to targets and find new targets during the round. To perform rating, subjects placed a physical image representing that condition along a row of numbers ranging from 1 to 10. If a subject's opinion on a visual aid changed during a later round, they were allowed to change any scores between rounds. After the experiment, subjects were given a brief exit questionnaire asking for general comments and which visual aid they preferred most.
The result is a 6 × 2 within-subjects design. The condition specifies if the subject used one of the four visual aids, the unaided baseline, or the unaided introduction. The stereo mode specifies mono-or stereoscopic rendering. Each 1 3 combination was tested 18 times per subject, and each subject completed 216 tests total.
Measures primarily focused on time to click and pointing error at time of click. To also measure pointing motion over time, error was also recorded at each Kinect update step (30 Hz). Error was again measured in terms of azimuth and elevation.

Results and analysis
We present results in terms of pointing accuracy at selection time, variance of pointing accuracy, time taken to select, pointing motion over time, motion in different regions, and user ratings. For accuracy and time measures, we performed two-way ANOVAs to compare the effects and interactions of condition and stereo mode, and followed up with post-hoc bonferroni-corrected pairwise t-tests when appropriate. To address possible violation of sphericity assumptions, degrees of freedom and p values were corrected using Greenhouse-Geisser (GG) estimates. For subjective ratings, we performed a Friedman test followed by pairwise wilcoxon signed-rank tests between conditions.

Accuracy
Signed angular error results taken at selection time can be seen in Fig. 8. A two-way ANOVA on azimuth error revealed a main effect for condition Post-hoc comparisons again showed that introduction and baseline conditions performed worse than all visual aids ( p < 0.001 for all). Introduction and baseline were not shown to differ ( p = 1.0 ). Target rods were shown to result in subjects pointing farther above the target than when using pointing wand ( p = 0.003 ), pointing fan ( p = 0.009 ), and sphere cutout ( p < 0.001 ), though the error was not necessarily "worse" (i.e., farther from 0). The other three were not shown to significantly differ from each other ( p = 1.0 for all).
In general, subjects overestimated elevation when using target rods, while they underestimated elevation using the Fig. 8 Mean azimuth and elevation errors at the time of target selection, with standard error bars. Results for conditions are separated by stereo mode. Azimuth error represents an error in the depth of subject pointing, with positive error meaning the subject pointed in front of the target. Elevation error represents height error, with positive error meaning the subject pointed above the target Fig. 9 Histogram of azimuth error for all subjects. The top graph shows distribution for the two unaided conditions, the middle graph for the target defined visual aids, and the bottom graph for the target undefined aids. Data from stereo-and monoscopic stages are combined other visual aids. We expect that this is because the target rods technique did not explicitly give feedback for elevation. This feedback was not given because it was assumed the users would be able to align their pointing arm within the rod to achieve accurate elevation. We saw that instead, subjects seemed to only focus on the explicit feedback given in terms of color and shape. This is in line with results from the preliminary study showing other visual aids without elevation feedback under-performing in accuracy. This is further supported by a high overestimation in the two unaided conditions.
In every condition, subjects were shown to have pointed at least somewhat in front of the targets, with the error being substantially higher when unaided. The higher error in the sphere cutout condition suggests that the type of additional accuracy information afforded by the aid may not actually help. One interesting distinction in design between target rods and sphere cutout (the two target defined cases) is that they show accuracy in a discrete fashion (rod color changing upon meeting a low error threshold) versus a continuous fashion (the shape of the cutout changing continuously with error). The higher error in the sphere condition may suggest that discrete changes in feedback can provide better results.
As in the preliminary study, in addition to being less accurate, the baseline and introduction conditions showed considerably larger variances. This can be seen in Fig. 9, which shows the distribution of errors for each condition. A high variance indicates low precision regardless of pointing estimation method. We note a wide variance in baseline and introduction conditions compared to the relatively narrow visual variance of the visual aids.
Notably, the variation for the introduction condition is skewed considerably toward positive error, while the baseline is centered closer to 0 (accurate). Because subjects had likely seen and used a visual aid before performing the baseline condition, we infer that subjects learned something about the necessary motion by using the visual aids. This is in line with subject questionnaire answers that suggested the aids helped developed muscle memory and were "like training wheels".

Effect of stereoscopy
Stereoscopic depth cues are often considered to be important for pointing in a 3D environment (Stuerzlinger and Teather 2014). However, recent work has shown the potential for the natural stereo discrepencies in devices with vergence-accommodation conflicts to negatively affect pointing estimations, especially at larger egocentric distances (Barrera Machuca and Stuerzlinger 2019; Lin and Woldegiorgis 2017). While our preliminary study showed a limited improvement to unaided pointing when introducing stereoscopy, it did not explore the differences when used in tandem with the cues.
Results (given above) showed a main effect for stereo mode on both azimuth and elevation. Mean azimuth error values show the monoscopic view ( 5.22 • ) produced higher error than the stereoscopic view ( 4.63 • ). Mean elevation error shows the monoscopic view ( .76 • ) produced marginally higher elevation than stereoscopic view ( 1.08 • , though it is again difficult to say if one is worse. To better understand the effect, we performed paired sample t-tests within each condition between stereo and mono modes. We were most interested in differences within introduction and baseline conditions, which would reflect the effect of stereoscopy before and after training. For azimuth error, neither the introduction condition ( p = 0.111 ) nor baseline condition ( p = 0.395 ) showed a statistically significant difference between mono and stereo. However, visual inspection of Fig. 8 suggests a possible (but not statistically demonstrated) decrease in this difference from the introduction condition to the baseline, suggesting that subjects may have initially benefited from these stereoscopic cues in the introduction phase, but the benefit reduced after subjects adjusted to the task. For elevation error, target rods were the only condition with significant difference ( p = 0.032 ) between stereo and mono. Interestingly, the inclusion of stereo cues actually slight worsened elevation accuracy for this condition.

Time
Time-to-select can be seen in Fig. 10. A two-way ANOVA revealed a significant main effect for condition [ F(3.27, 104.64) = 19.947, p < 0.001, GG = 0.65 ], but no Posthoc tests showed that the introduction and baseline conditions were faster than all visual aids ( p < 0.001 for all), and baseline was faster than the introduction ( p = 0.009 ), suggesting that subjects got faster using unaided pointing after practice. There were no differences between visual aids ( p = 1.0 for all).
In comments collected after each round, several participants reported that they felt they were able to click most quickly when they were unaided, but noted that they were not confident in their accuracy. Having a higher time-toselect for aided pointing is expected and in line with preliminary study results, as the subjects were made more aware of their inaccuracy, and thus would spend more cognitive effort (increasing time) judging if they were accurate before clicking.
However, we suspected that the subjects had actually reached the targets quickly, and spent much of the time refining or confirming position. This is supported by results in the following section. Therefore, we consider this data to not be evident of the time it took to accurately reach a target, but rather the time subjects took to decide they were done refining their accuracy. In this way, all visual aids were faster than unaided pointing, but when compared to each other were not shown to be significantly faster.

Motion over time
We sampled the reported pointing error at every Kinect update step. Figure 11 shows this data plotted for 3 s of motion for azimuth error. We show 3 s of data because pointing stabilizes for all conditions near 2 s, and 3 s is near the average time-to-select for the baseline. We consider pointing to be broken into two primary types of motion (Meyer et al. 1988): ballistic (large initial movements made to get near the target) and corrective (smaller controlled movements made to hone in on accuracy). We look at the motion primarily in these terms.
For azimuth error, the introduction deviated most clearly from all other conditions. Ballistic motion ended with subjects approximately 20 • in front of the target. Corrective motion did not move substantially, and subjects made no further ballistic motions. However, the motion for unaided pointing changed after the subjects had been exposed to the visual aids. The baseline condition followed a very similar ballistic path as aided pointing. However, during the corrective phase, subjects pulled further in front of the target.
This difference in ballistic motion between the Introduction and Baseline implies that once subjects understood the task through the use of the visual aids they were able to replicate that ballistic motion. This implies that the visual may not need to be always visible, and could be displayed only when some pointing motion is detected to reduce on-screen items when unnecessary.

Region analysis
We consider that one major difference between pointing with a first person perspective and pointing with a mirrored third person perspective is that visible targets can be placed in front of, behind, or to the side of the user. In contrast to this, a first person view emphasizes targets in front of the user, and the user will not need to point behind themselves (targets may exist behind the person, but users will likely be able to rotate to face targets). We consider that different visual aids Fig. 11 Average motion over time, shown in terms of mean azimuth error (degrees). To cope with subjects having data points reported at different times (due to unsteady update rates), data points are averaged in 'bins' of 50 ms may yield different pointing motions in different regions. Figure 12 shows motion over time for the targets in front of and behind the subject's avatar.
We compare the trajectories from Fig. 12 to those shown in Fig. 11. While the average error for the sphere cutout visual aid appeared to be low, it is seen to have been much less accurate in the front and back regions. We expect that the inaccuracy in the front region is due to the difficulty of seeing the sphere cutout's shading behind the target, which is larger than the targets in the back region due to perspective foreshortening. No other visual aids substantially stand out from one another.
The baseline condition was comparatively accurate in the front region, with subjects on average reaching the target before any of the other conditions, while in the back region it again faced the problem of subjects pulling away from the target during corrective motion. Similarly, the introduction condition was comparable to other conditions in the front region, but was substantially less accurate in the back region. We expect this to mean that subjects were less certain of their accuracy for targets behind them, and did not initially correctly perceive the amount of motion it would take to reach the target.
Starting error was higher for the front region due to the hand's resting position between trials. The general ballistic motion for the back region was shorter (approximately 1 s) compared to the front (approximately 1.5 s). After the introduction condition, subjects would quickly recognize that the target was behind them, move quickly to that general region, and then spend more of their pointing time doing corrective motions.

Subjective ratings and comments
Subjective ratings can be seen in Fig. 13. A Friedman test revealed differences between the five conditions ( 2 (4) = 41, p < 0.001 ). As expected, when compared to the baseline the sphere cutout, target rods, pointing wand, and pointing fan were all shown to have statistically significantly higher ratings ( p < 0.001 for all). However, there were no statistically significant differences between the visual aids. Despite the target rod's wide variation in rating, the highest number of people (9) referred to it as their preferred visual aid in the exit questionnaire.
Other interesting comments regarded the effect of stereoscopy on subject ability to complete the test. 13 subjects indicated that they believed the stereoscopic effect helped, while results do not support this. One subject noted that they felt the 3D increased his confidence for targets that were at a similar depth as his avatar. Others suggested that the stereoscopic effect felt more natural. This suggests that virtual mirror applications may not need the stereoscopic effect, but users may appreciate it and have more confidence in their pointing if it is included.

Discussion and conclusion
We presented two studies on the design and testing of visual aids for pointing in a third-person perspective in a virtual mirror. A preliminary study aimed to tune a large number of aids to desirable parameters and narrow down which might be best for a larger study. The main study more deeply analyzed the best performing of those techniques and collected more information about how users point in general given the mirrored third-person perspective. We discuss findings about generalized unaided pointing and stereoscopy, then discuss differences between techniques.

Unaided pointing
Analysis of accuracy and variance shows that unaided pointing is both inaccurate and imprecise, with unaided pointing after training showing a wide variance and an average 12 degrees of error from the target in the azimuth axis. This imprecision would make it difficult to correct for in a UI item selection scenario and confuse other users in a communication scenario.
Looking at the unaided motion over time in general and in different regions revealed more interesting information. Looking at motion for all cases, while the introduction phase showed a distinct difference from the aided pointing, the trained unaided baseline actually closely followed the ballistic motion of aided pointing, only breaking off to pull in front of the target in the corrective phase of motion (after Fig. 13 Rating scores given to each visual aid and the unaided baseline by subjects. Horizontal lines denote medians, pluses denote outliers, and triangles denote means around two seconds). This trend held for the back region as well, with the untrained unaided pointing being less accurate than other cases and the trained unaided pointing following aided pointing and pulling away. However, in the front region, trained unaided pointing followed aided pointing for its entire length, and actually hit an accurate point before any other condition.
We draw several conclusions from this. We suggest the difference in pointing behind and in front of the avatar may be due to the extra effort it takes to reach back causing users to settle into a more natural stance pointing in front of the target. We infer that after some training, users may not need visual aid to correctly point at targets in front of their avatar. Since subjects reported not feeling confident about their accuracy without feedback, some minimal feedback may be desired (e.g., highlighting a menu UI button or showing a wand when near a target), but feedback may be unnecessary if designers believe a visual aid may be too obtrusive. Since users tend to follow the correct ballistic motion for targets behind their avatar, a system with known targets may only need to give feedback once pointing is detected. In a system for communication, however, developers may want to show visual aids at all time, since users still show a high variance and lack of confidence in pointing.

Stereoscopy
In the preliminary study, we tested unaided stereoscopic 3D as a baseline, based on the idea that stereoscopy can aid in pointing in 3D (Stuerzlinger and Teather 2014). However, results suggested that this may not be the case for this thirdperson view. To further test this, we tested each visual condition in both mono-and stereoscopic views in the main study.
Results showed minimal differences between mono-and stereoscopic conditions. For azimuth error, no conditions showed a significant difference between the two views, with only visual inspection suggesting possibly less error in the stereoscopic condition. Any difference there may have been, however, disappeared in the baseline unaided condition, when the subject had experience with this style of pointing. Despite this, subjects did report feeling more comfortable with their pointing in the 3D view. We infer from this that a designer would not need to include stereo 3D in a virtual mirror application to reinforce pointing, but users may appreciate its inclusion.

Visual aid comparisons
Our main study focused on the pointing wand, pointing fan, target rods, and sphere cutout visual aids. We originally suspected that visual aids that worked with defined targets (target rods and sphere cutout) would allow users to point more quickly and accurately, as feedback could be based on pointing error. However, this was not shown to be the case; there was little evidence to support substantial differences between aids in these groups, nor were they substantially more preferred. The sphere cutout technique, in fact, was shown to provide worse accuracy than the other aids, especially in the front region where the cutout shading was not as visible. The rods superiority in reducing azimuth error may suggest that if targets can be defined, it is best to give the user discrete feedback about their pointing (e.g., turning green when the user is close) as opposed to continuous. However, the target rods' limits in reducing elevation error suggests that if feedback is to be given at all, it is important to include feedback on elevation error and not just azimuth.
Given the similarities in performance between the different visual aids, it is difficult to suggest a single one (although they can be recommended over others excluded based on the preliminary study). A suggested visual aid should ideally be accurate, precise, quick to use, stable once reaching the target, and users should be comfortable with it. Considering these priorities, and the more restrictive use case for target-defined visual aids, we suggest using the pointing fan when it is acceptable to visually obstruct large parts of the scene, and pointing wand when it is not. While users stated that they could more easily tell the depth at which they were pointing with the fan, its obstructive appearance made the experience less immersive. The wand obstructed less of the environment, and the scaling allowed users to see if they were hitting a target, but did not provide a good sense of depth until the target was found. Additionally, considering that users tend to have a fairly accurate ballistic motion towards the target, visual cue imagery could be faded in when approaching an angular distance threshold from any defined targets and the system detects the user is ready to perform corrective motions.

Limitations and future work
While we attempted to account for limited accuracy of the Kinect tracking system, having to define the pointing direction as from the elbow to wrist may make pointing feel unnatural for some subjects. Comfort and recording accuracy may be increased by advanced sensors that can more accurately detect the user's intended pointing through, e.g., wrist to finger direction.
Gender in our sample was imbalanced, with a 10:1 male to female ratio. While we did not see a difference between genders in an initial survey of our data, we cannot rule out the possibility of men and women performing different pointing motions on average. Future work may aim for a more even balance of gender to ensure results are generalizable.
Future work may also further distinguish between visual aids by presenting them among different tasks (e.g., object