1 Introduction

In immersive virtual environments (IVE), mid-air manipulation has been a popular research topic. Its goal is to provide a natural way of handling 3D objects in IVE. There are many factors that affect the efficiency and convenience of mid-air manipulation. They include degrees of freedom [20], virtual object shapes [15] and sizes [30]. With respect to the object sizes, it has been discovered that the performance of manipulation tasks decreases significantly if virtual objects are too small or too large [8, 35, 38].

One of the techniques widely used to overcome the size problem is zoom [2, 11, 34], but the artifacts caused by zoom have also been recognized well. To list a few, they include mismatch between the physical and scaled virtual worlds [28], misperception of scale [9] and loss of perspective [40].

On the other hand, virtual objects can be manipulated in proxy-based methods [25, 26], where the virtual objects are provided as proxies of “convenient working sizes.” Unfortunately, the sizes were selected heuristically in the previous works. To the best of the authors’ knowledge, no studies have been made on the convenient working sizes.

This paper reports the results of our efforts to investigate the object sizes that are optimal and convenient for mid-air manipulation. Using a well-known mid-air manipulation method, the optimal size of a reasonably complex object (Utah teapot) was estimated through an experiment. Through another experiment with zoom operation, a range of convenient working sizes was also estimated. Finally, the optimal and convenient working sizes were verified via an additional experiment based on a new proxy-based method. We believe our studies give a guideline for advanced studies on the object size problems in IVE.

2 Related work

Various mid-air manipulation techniques have been proposed in order to manipulate virtual objects with accuracy and speed [18]. The most intuitive ones are the 6DOF techniques [3, 12, 16, 36], since they mimic the way humans manipulate a single object in the real world. The PRISM technique [6, 7] increases the accuracy of 6DOF manipulation by scaling down the translation and rotation of objects. In addition, symmetric-synchronous techniques such as Handle-bar metaphor [32] and Spindle \(\times \) Wheel [5] allow manipulating multiple objects by grabbing the objects with both hands. For example, the global translation/rotation of the objects is enabled by simultaneously moving both hands in the same/different direction.

On the other hand, some researchers focused on the comparisons between the proposed techniques as well as their evaluations. For example, Rodrigues et al. [31] compared and evaluated the 6DOF techniques and the Handle-bar metaphor, whereas Mendes et al. [19] compared different manipulation methods using a stereoscopic tabletop-based approach and mid-air bi-manual methods. They also evaluated the benefits of adopting DOF separation in mid-air manipulation against direct manipulation [20].

There have been several papers that address the manipulation of distant objects, as the visual size of the objects may vary in the perspective view. For example, Whitlock et al. [37] evaluated the effects of different modalities. The Go-Go technique [29] manipulates distant objects by introducing a nonlinear mapping between the physical hand and the virtual hand in order to extend the user’s reach in the virtual environment. The hybrid interaction technique named as HOMER [3] uses ray-casting for ease of selecting and manipulating distant objects. Mine et al. [22] introduced the concept of scaled world grab, which scales up the world to deal with the relatively resized object.

In accordance with [30], sizes of virtual objects have significant effects on mid-air manipulation. For example, Scaled HOMER [38] revealed significant interactions between techniques and target sizes. Laviola et al. [14] have found that it is difficult to select and manipulate virtual objects of different sizes not only from multiple angles but also at different distances to the user’s virtual avatar, which requires significant time and efforts on navigation. Selecting virtual objects in IVE is studied, and its elapsed time is also measured with varying conditions of occlusion, size and density [35]. In this study, it has been observed that the elapsed selecting time for small objects is usually doubled than that for large objects. PinNPivot [8] which constrains rotation of virtual objects has discovered that positional error of large object is greater than that of small object.

Zoom techniques, which resize a user [1], have been also widely studied. GulliVR [13], for example, adjusted the depth perception during scaling user by resizing IPD to match the player’s size in order to prevent cybersickness. The technique which allows to rotate, scale and translate the virtual scene about the user reduces the need of physical locomotion [21]. The EiHCam [17] uses an extra-camera in immersive environments, which can be scaled by a pinch-gesture. Wang et al. [34] provided zoom using pinch gesture in complex level editing tasks. MakeVR [11] enabled users to translate, zoom and rotate in real space rather than on a flat display. The comparison between a zooming interface and room-sized visualization has been conducted with and without an overview [40].

The methods of a proxy of virtual environments have been widely researched to provide interaction techniques such as object selecting, navigating, path planning, and visualization. Worlds-In Miniature [33] demonstrated the remote manipulation technique, providing small copies of the world held within the hands. Voodoo Dolls [26] creates an instant copy of the object, resizes the copy into comfortable working sizes, and attach the object to the user’s hand. The image plane interaction technique [25] deals with 3D distant objects by interacting with 2D projections of the objects on the image plane. GARDEN [24] was designed to increase the accuracy of manipulation of distant objects in a shared environment by using sphere casting. Remote Collaboration presents a similar system which provides the remote user with the ability to produce and manipulate 3D objects to be assisted by communication [4, 23]. Recently, Poros [27], a system that allows users to rearrange distant space after marking a portion of the space, has been proposed. Magnoramas [41] presents a solution for the simultaneous view on the virtual visualization of the real-world and a user-controlled copy of a region of interest.

3 Mid-air manipulation

Our study focuses on manipulating 3D rigid objects through translations and rotations in IVE. For experiments, we implemented a well-known manipulation technique named 6DOF method [36].

Fig. 1
figure 1

Utah teapot manipulated with 6DOF method

3.1 6DOF method

As the name says itself, 6DOF method supports 6DOF motions which are comprised of 3DOF translations and 3DOF rotations. It mimics interactions with physical objects, i.e., it consists of grabbing an object, moving it to a new location, and releasing it. Dragging an object changes the object’s position and the wrist’s rotation changes its orientation. Throughout the entire manipulation, both translation and rotation are applied to the object at the same time. See Fig. 1. When the user touches an object, its wireframe bounding box appears in green, and the virtual hand touching the object also turns into green. The colors turn into red when the user grabs the object. After being grabbed, the object follows directly the movement of the hand. Henceforth, the 6DOF method is denoted simply as 6DOF.

3.2 Supportive operations

In our experiments, two operations are supported for mid-air manipulation.

3.2.1 Grab-navigation

In our experiments, subjects are often asked to manipulate a huge object. Unfortunately, it may be larger than a real-world room, making it impossible for the subjects to look around the object or look down on it from high in the air.

In order to tackle the problems, a supportive operation is provided, which we call grab-navigation. By grabbing an empty space with a single hand and pulling it along a direction, we can move along the opposite direction. This enables us to navigate the virtual space in any direction we want [10, 39].

3.2.2 Zoom

This is used to scale up or down the whole virtual environment [40]. When the user grabs an empty space with both hands and squeezes it, the scene is scaled down so that the user would feel like a giant. In contrast, by stretching the empty space, the scene is scaled up.

4 Experiment setup

We conducted three experiments: (1) pilot study, (2) main experiment, (3) additional experiment for verification. The pilot study was conducted with a small number of participants for the purpose of designing the main experiment. The main experiment was conducted with a larger number of participants. Its goal was to discover the object sizes that are optimal and convenient for mid-air manipulation. Then, using a new proxy-based manipulation method, an additional experiment was made to verify the discovered optimal and convenient sizes.

Oculus Quest 2 was used for all experiments. Its wireless HMD and controllers communicated with a desktop PC over 5G Wi-Fi network. Through the PC screen, the experiment supervisor shared the view displayed at the participants’ HMD. In the controllers, only trigger and grip buttons were used. All experiments were pre-approved by the authors’ institutional review board.

Fig. 2
figure 2

Virtual workspace used for all experiments

Fig. 3
figure 3

Docking task between green source and silver target

4.1 Virtual workspace

Fig. 4
figure 4

Eight object scales from 0.01 to 10 m

The virtual workspace used for all experiments is shown in Fig. 2. It is a \(30\,\mathrm{m}\times 30\,\mathrm{m}\times 30\,\mathrm{m}\) cubical space, at the center of which a translucent blue square (\(2\,\mathrm{m}\times 2\,\mathrm{m}\)) mat is located. The subjects stand on the mat. It corresponds to a ‘clear’ real-world floor. As the subjects are asked not to go out of the mat, it helps them avoid collision with the real-world obstacles.

In order to provide the visual effect of grab-navigation and zoom operations presented in Sect. 3.2, six faces of the cube are all textured with grid patterns, where parallel lines are placed at intervals of 1 m.

4.2 Docking tasks

In all experiments, the subjects are asked to perform docking tasks between the same size objects. As shown in Fig. 3, we use Utah teapots. The source teapot (in green) should be moved to fit to the target teapot (in silver) that is fixed and immovable in the space. The Utah teapot is selected due to its distinct geometric features, such as spout, handle and lid. As discussed by Martin-Gomez et al. [15], such distinct features provide useful cues for orientation.

In the docking tasks, different scales of the teapot were used. See Fig. 4. In terms of the length of the line connecting the spout and handle, the sizes of 0.01 m, 0.05 m, 0.1 m, 0.5 m, 1 m, 3 m, 5 m and 10 m (in increasing order) were used in the pilot study. The minimum, 0.01 m, and the maximum, 10 m, were heuristically determined. While designing the pilot study, we figured out that objects that are smaller than 0.01 m or larger than 10 m are very difficult or almost impossible to manipulate.

Fig. 5
figure 5

Poses of source and target objects

4.3 Object placements

In the docking tasks, the target objects are placed in front of the subject “one at a time” with different poses. Figure 5a shows an imaginary cone, the apex of which is in the middle of the shoulder. The apex angle is 120\(^{\circ }\). The boundary of the cone’s base is regularly sampled at 45\(^{\circ }\) intervals to make eight sample points.

A target is placed per sample point. Figure 5b shows the steps to place a target. Initially, the target’s center is located at a sample point. It is rotated through Euler transform. Each of the Euler angles, \(\theta _x\), \(\theta _y\) and \(\theta _z\), is either \(+45^{\circ }\) or \(-45^{\circ }\) so that the targets have eight distinct orientations in total. Shown in the middle of Fig. 5b is one of them. Then, the target is translated horizontally such that it contacts but does not penetrate the cone’s base. Each subject’s shoulder height and arm length are measured before the experiments, and all targets are placed within arm’s reach regardless of their sizes.

The source object is in front of the subject at the shoulder’s height. See the green teapot in Fig. 5c. Whereas the source’s size changes over docking tasks, so that it is always the same as the target’s, its orientation is fixed. The source’s position is adjusted to make its surface within arm’s reach. Figure 5d, e shows additional source–target examples.

Fig. 6
figure 6

The locations of the source and target objects are highlighted before the docking task is started

The source and target appear in front of the subjects with the ‘initial’ size of 0.2 m, as shown in Fig. 6a, and then, they are scaled up or down to the pre-specified scale, e.g., 0.01 m shown in Fig. 6b. (The initial size, 0.2 m, is determined heuristically.) The scaling animation takes 1 s. After scaling is completed, two red circles pop up and disappear for the subjects not to miss the scaled objects. See Fig. 6c.

5 Pilot study

5.1 Participants and procedure

The pilot study was made with five participants, including three authors of this paper. Each subject performed 16 docking tasks, which were grouped into two experimental blocks. In a block, the eight targets were distinct in every aspect of size, position and orientation, as shown in Fig. 5c through Fig. 5e. To mitigate possible bias, the order in which the targets appear was given by Latin squares. By using the pattern of a balanced Latin square, the ordering can be counterbalanced, and the order-effect can be reduced [42].

Fig. 7
figure 7

Finish button. a Until touching the button, it remains black. b If the subject’s hand stays touching the button for a second, the green progress bar starts to grow. It is 100\({\%}\) filled after 1 s. Then, the next docking task is presented to the subject

As can be found in Fig. 6, there exists a button beside the subject. If a subject is satisfied with the result of a docking task, the subject can proceed to the next task by touching the button. We call it ‘finish’ button. In order to avoid unwanted finish that can be made by accidentally touching the button, subjects were asked to keep touching the button for 1 s. See Fig. 7.

Fig. 8
figure 8

Completion times in pilot study

5.2 Results and analysis

In each docking task, we measured the completion time consumed until the finish button was touched. The statistics for eight different scales are depicted in Fig. 8. At the stage of designing the pilot study, longer completion times were expected for small and large objects, and shorter times were for medium-size ones. However, Fig. 8 does not exactly match the expectation. During the pilot study, we observed that many subjects consumed unnecessarily longer time to achieve higher precision even after the source was almost fit to the target. This suggested the need for establishing a certain error threshold such that the docking task can be judged to be completed once the error between source and target falls below the threshold.

Fig. 9
figure 9

Normalized translation error and rotation error in pilot study

During the experiment, we measured translation and rotation errors. The translation error is defined as the Euclidean distance between the centers of source and target. It is normalized so that for example, 1 m error between 10 m-long objects is identical to 0.5 m error between 5 m-long objects, i.e., the errors equal 0.1. This normalized translation error is denoted as NTE. On the other hand, the rotation error denoted as RE is defined as the angle between two unit quaternions that represent the orientations of source and target. The measured errors are depicted in Fig. 9.

Through the regression analysis, the ratio between NTE and RE was computed as 0.008:1. We defined the total error as \({\textbf {NTE}}+0.008\times {\textbf {RE}}\). Then, we heuristically set the threshold of the total error to 0.1.

During the docking tasks, the subjects keep repeating grab and release actions presented in Sect. 3.1. We measured the pose of the source whenever it was released. Using the logged data, the completion time is recomputed, i.e., as soon as the pose error falls below the threshold, the docking task is judged to be completed and the elapsed time is recorded. Figure 10 shows the recomputed completion times. The results are largely compatible with our expectation, showing that the threshold is reasonably set.

Additional observations can be made in Fig. 10. First of all, most docking tasks were completed in 60 s. This led us to introduce the time limit for a task in the main experiment, i.e., if a subject fails to complete a docking task in 60 s, the task will be finished, and the completion time is set to 60 s. In the pilot study, there was no time limit, and we observed that subjects often got exhausted at the end of the experiment and tended to finish the tasks even though the objects were not sufficiently fit. This also supported the introduction of time limit.

Secondly, the statistics differ significantly between 0.01 and 0.05 m in Fig. 10. The same can be found in Fig. 9a (for translation errors) and Fig. 9b (for rotation errors). This led us to sample the interval between 0.01 and 0.05 m. We add three more scales, 0.02 m, 0.03 m and 0.04 m, in the main experiment.

In summary, the pilot study contributed to the following features of the main experiment: (1) error threshold, (2) time limit (60 s), (3) additional scales of 0.02 m, 0.03 m and0.04 m.

Fig. 10
figure 10

Completion times recomputed using a threshold

6 Main experiment

In the main experiment, the participants performed docking tasks not only with 6DOF but also with 6DOF (Z), which is 6DOF augmented with the zoom operation presented in Sect. 3.2.2. With 6DOF, the optimal object size is identified, which minimizes the completion time. It is the size for efficiency. In contrast, with 6DOF (Z), the convenient working sizes (henceforth, simply convenient sizes) are identified, which are selected by the participants. All subjects performed 6DOF ahead of 6DOF (Z) so that the completion times measured with 6DOF are free from learning effect.

6.1 Participants and procedure

In the experiment, thirty-three subjects participated: Seven were females and twenty six were males. Their ages ranged from 18 to 35 (\(M = 23.12\), SD \(=\) 3.15). All subjects had normal or corrected-to-normal vision. Only two subjects used the left hand as the dominant hand. Ten subjects did not have experiences of playing 3D games. Ten subjects had no experiences of wearing VR devices such as HMD, and sixteen subjects have not used gesture-based devices or controllers in IVEs.

During the time of this study, COVID-19 spread rampantly. Upon arrival at the building, all participants took their temperatures at the lobby. They were allowed to enter the building only if they did not have a fever. HMD and controllers were disinfected with a sanitizer before and after the experiment.

Table 1 Experiment procedure for each method

Before the experiment began, the interpupillary distance (IPD) of each subject was measured for HMD setup. Then, each subject tested 6DOF. The experimental procedure for 6DOF is presented in Table 1 with the elapsed time on each step. After watching a tutorial video, the subject had a 10-min practice using three objects shown in Fig. 11. (Neither the tutorial nor the practice was for docking tasks. They were about general 6DOF manipulation.) Then, the rubric for the docking tasks was presented, as shown in Fig. 12. Finally, the subject started the docking tasks by touching the button. It worked in the same way as the ‘finish’ button used for the pilot study (Fig. 7), but it was used for ‘start’ in the main experiment.

Fig. 11
figure 11

The objects used for practice in the main experiment

As presented in Sect. 5.2, three more scales (0.02 m, 0.03 m and 0.04 m) were added to the eight scales used for the pilot study. For each method, a subject performed 22 docking tasks, which were grouped into two experimental blocks. The orders of size, position and orientation of the targets were given by Latin squares.

Each docking task was finished either when the source was fit to the target within the tolerance or after 60 s passed. The 22 docking tasks may take up to 22 min, as each task is assigned 1 min at maximum. However, the entire docking tasks took 10.44 min on average.

After taking a 10-min break, the procedure presented in Table 1 was resumed with the next method, 6DOF (Z), where the 22 docking tasks took 3.91 min on average. The whole experiment with two methods took about an hour, and a subject was offered 30 USD for participation.

Table 2 Results of post hoc analysis between the adjacent object sizes in terms of completion times
Fig. 12
figure 12

The rubric presented in the main experiment

Fig. 13
figure 13

Completion times of 6DOF in the main experiment

6.2 Results and analysis

6.2.1 Optimal size

For 6DOF, the completion times consumed by a subject for finishing 22 docking tasks were measured. Figure 13a shows the completion times for eleven scales in 6DOF.

In order to remove outliers, only the data between the first quartile (Q1) and the third quartile (Q3) are considered in each scale, i.e., Q1 and Q3 are the upper and lower bounds of the completion times, respectively.

We need to fit the data between Q1 and Q3 with a function so that its argmin (argument of minimum) is taken as the optimal value. Before fitting, the data are post-processed. For example, many values for 0.01 m and 0.02 m in 6DOF would be greater than 60 s without the time limit. Such data are unreliable. We abandon the data, Q3 of which reached 60 s.

The remaining reliable data are plotted in Fig. 13b. Note that the completion times are dramatically high for small objects, while they are moderately high for large objects. A one-way analysis of variance (ANOVA) test revealed that the differences between the object sizes in terms of completion times are significant \((F = 42.1, p < 0.001)\). For the post hoc analysis, we conducted multiple F-tests between the adjacent object sizes. Once an F-test revealed the inequality of variances, a Welch’s t-test (unequal variance t-test) was conducted. Otherwise, a t-test assuming equal variances was conducted. The results are presented in Table 2, where asterisks indicate significant differences (*\(p < 0.05\), **\(p < 0.01\), ***\(p < 0.001\)). Based on this observation, we selected “square of log” function to fit the data:

$$\begin{aligned} a(\log _{b} (cx))^2 + d \end{aligned}$$
(1)

The fitted graph is overlaid on the data in Fig. 13b. It shows that the argmin in 6DOF is 0.45 m where the minimum completion time is 5.28 s. We take the argmin as the optimal size in 6DOF.

Fig. 14
figure 14

Convenient sizes for 6DOF (Z) in the main experiment

6.2.2 Convenient size

In 6DOF (Z), a subject is allowed to scale up and down the virtual experiment using zoom to obtain the “convenient size” of the source and target. If a subject takes the given size as sufficiently convenient, however, no zoom operation will be made. We take the convenient size as the object size at the time when a docking task is completed. It may or may not be a scaled one.

Figure 14a shows the convenient sizes for each object scale. If the given size is small, subjects tended to scale it up. If the given size is large, subjects tended to scale it down.

Figure 14b depicts the statistics of all convenient sizes. We take the data between Q1 (0.17 m) and Q3 (0.47 m) as the range of convenient sizes.

Fig. 15
figure 15

A docking task completed with Diorama

Note that the optimal size, 0.45 m, discovered in Sect. 6.2.1 lies in this range. Interestingly, it is closer to the upper bound of the range, 0.47 m, than to the lower bound, 0.17 m. This implies that convenience is rather distinguished from efficiency that is related with the optimal size.

In order to verify the optimal size and the range of convenient sizes identified in the main experiment, we developed a proxy-based metaphor, named Diorama, in which a proxy of an object is given “with the optimal size.” Diorama also supports zoom (presented in Sect. 3.2.2). We hypothesized that providing such a proxy helped users complete the docking tasks by minimizing the use of zoom operations.

7 Application and additional experiment

7.1 Diorama—proxy-based manipulation

Figure 15 shows a scenario where a docking task is completed with Diorama. In addition to the green source and silver target, the diorama is presented to the user. As shown in Fig. 15a, it is composed of three handles (named crop-handle, torus-handle and stick-handle) and what we name proxy-box, in which objects are manipulated.

When the user touches the source and then selects it (by pressing both trigger and grip buttons), its wireframe bounding box becomes red and its proxy appears in proxy-box (Fig. 15b). The environment surrounding the selected object is also copied into proxy-box. Observe that part of the silver target is visible at the lower-left corner of proxy-box.

By grabbing and dragging stick-handle, the user translates proxy-box to a more convenient location to work (Fig. 15c). Then, in order to secure a better view to work, the user rotates proxy-box by grabbing and rotating torus-handle (Fig. 15d). While the user moves proxy-box, the proxies are accordingly moved. However, the original source and target are fixed in space and do not move at all.

The user considers proxy-box too small and so scales it up by grabbing and pulling crop-handle (Fig. 15e). By doing so, the target in proxy-box is fully visible to the user. Then, the user starts the docking task inside proxy-box (Fig. 15f). This time, the original source is accordingly manipulated.

In Fig. 15f, the proxies are not located in the middle of proxy-box. Feeling uncomfortable with that, the user performs grab-navigation (presented in Sect. 3.2.1) by grabbing and pulling the empty space of proxy-box so that the proxies are located in the middle (Fig. 15g). Then, in order to be able to manipulate the source more accurately, as shown in Fig. 15h, the user performs zoom by grabbing and stretching the empty space of proxy-box. (Zoom is not available outside proxy-box.) Neither grab-navigation nor zoom alters the original source. Figure 15i shows the state where the docking task is completed.

Suppose that a user judges that the given size of an object is appropriate enough to be manipulated. Then, the user is allowed to manipulate it directly without making a proxy. This implies that ‘diorama’ is not used at all, but the user simply performs 6DOF.

Following the procedure presented in Table 1, the completion times were measured in Diorama. The experiment was made one week after the main experiment, with the same (thirty-three) participants of the main experiment.

Fig. 16
figure 16

Classification of docking tasks in Diorama

7.2 Results and analysis

As in the main experiment, eleven scales were used. Each scale was given twice, making a subject perform 22 docking tasks. In total, 726 docking tasks were performed by thirty-three subjects. Among them, three tasks were not completed in 60 s. Analysis is made with the remaining 723 tasks. Among them, 634 tasks were performed by using the diorama, whereas 89 tasks were performed without it, i.e., just with 6DOF. See Fig. 16. Among the 634 tasks completed by using the diorama, zoom was not used in 524 tasks. It was used in 110 tasks only. This fits our hypothesis. Providing the optimal-size objects minimized the use of zoom operations.

The mean of the completion time for “diorama (w/o zoom)” is 12.73 s, whereas it was 18.32 s for “diorama (w/ zoom).” Since the F-test detects inequality of variances between them, Welch’s t-test is performed. It reveals that there exists significant difference between them in terms of completion time (\(t = 5.27\), \(p < 0.001\)). This shows that manipulating an optimal-size proxy is more efficient than using zoom.

Fig. 17
figure 17

Convenient sizes for “diorama (w/ zoom).”

For the 110 tasks of “diorama (w/ zoom),” we measured the convenient sizes. As depicted in Fig. 17, Q1 of the data is 0.18 m and Q3 is 0.26 m. They fall into the convenient size range, 0.17 to 0.47 m, which is identified in the main experiment and colored in yellow in Fig. 17.

8 Conclusion and future work

This paper reported the results of a series of experiments made for investigating the object sizes that are optimal and convenient for mid-air manipulation in immersive virtual environments. We used a well-known mid-air manipulation method, 6DOF, and the Utah teapot, which has many distinct geometric features. Through an additional experiment based on a new proxy-based method, Diorama, the optimal and convenient sizes were verified. We believe that our studies can give a guideline for advanced studies on the size problems for mid-air manipulation. If fast and accurate mid-air manipulation is required, for example, a proxy-based method is suggested, where the proxy is given with the optimal size. If speed and accuracy are not a critical issue, however, we would suggest to provide the virtual objects in the range of convenient sizes.

However, there exist some limitations in our work. In the current study, a single method (6DOF) and a single object (Utah teapot) were used. There exist other mid-air manipulation methods such as Handle-bar metaphor [32], PRISM technique [6, 7] and ray-casting method [3]. It has also been reported that 3D objects can be classified by their shapes [43]. Therefore, it is worth investigating the optimal and convenient sizes with multiple manipulation methods and a set of classified objects. On the other hand, it would be also worth investigating if Diorama excels different proxy-based methods, where the proxy is given with no optimal sizes. In addition, recruiting subjects with more diverse natures (in terms of dominant hand, gender, and age) would help the study to be more generalized. These will make up our future research.