1 Introduction

In this study, we propose a sound-image icon, which is an invisible 3D object that integrates sound-source localization and mid-air tactile sensation. This sound and haptic feedback creates a virtual object with no visual appearance, without requiring the user to wear any devices.

Fig. 1.
figure 1

Concept of a sound-image icon. The person selects the sound-image icon of the air conditioner to control the temperature. The sound from the icon informs the user about the position of itself. The icon provides tactile feedback without any visual image.

A typical method of reproducing tactile sensation in mid-air  [2, 3, 6, 8] is installing an airborne ultrasound tactile display (AUTD) that presents the moving stimulus on a skin by remotely producing the radiation pressure of focused aerial ultrasound. Such mid-air haptic feedback has been integrated with visual floating images in previously conducted studies  [9, 11]. From the viewpoint of interface design, vision is the most efficient channel to transmit the spatial arrangement of an object to a user, while haptics facilitates the transmission of the will of a user to a computer system. Therefore, a 3D visual interface with mid-air haptics is a reasonable integration as an efficient interface. However, the eyes are sometimes occupied by a specific task such as in-car driving or a surgical operation. In addition, glassless 3D vision is still immature, where it is difficult to secure a wide view angle, while a head-mounted display sometimes causes fatigue and VR sickness. Instead of vision, the use of sound is another option to display the object position, as humans can instantaneously identify the direction of the sound when they are in an environment where sound can be clearly heard  [1]. In addition, the sound can transmit words and tones that express various attributes.

Many studies have been performed to create virtual sound sources at specific locations under various conditions and environments. Recent studies have focused on virtual sound source positioning for acoustic navigation in unknown spaces  [14] or vector base amplitude panning for creating 2D or 3D sound fields without considering the placements of any number of loudspeakers  [13]. Although the aforementioned technologies form a wide research domain, we could not find studies that integrated virtual sound sources and mid-air haptics technologies. It would be intriguing to investigate whether auditory and haptic perceptions effectively complement each other.

The concept of a sound-image icon is depicted in Fig. 1. The user hears a binaural sound and identifies the direction of the sound-source. We refer to this sound source as a “sound image.” The sound image can represent various functions, and users can recognize the role of the object by its sound. For example, an icon that represents an air-conditioner generates a sound to explain it in words, while an audio-volume icon produces a pleasant musical sound. The users utilize this sound as a clue to reach for the icons. Using tactile cues, they can recognize the exact positions of the icons and then perform fine tasks; for example, a user who accepts the objects of the air-conditioner can control the temperature by operating the sound-image icon. The haptic feedback is critical not only to improve the operability during the control but also to reliably guide and hold the user’s hand to the starting point of the operation.

In this study, we prototype sound-image icons and experimentally verify their feasibility. We aim to realize a system where a user can select the desired icon among multiple icons and operate it. In this study, for the first step, we examine whether users can find an icon and measure the time required to access it. The combination of auditory and tactile sensations enables users to accurately and effectively locate the icons.

2 Proposed Method

A sound-image icon is realized using a sound source with the sense of touch provided via acoustic radiation pressure. In this section, we describe the method used to create the sound-image icon.

2.1 Producing a Sound Image

The user specifies the direction of the sound source using a binaural sound. We plan to provide binaural sounds using ultrasound beams to reach the ears. However, in this experimental system, the binaural sound was provided to the users by an in-ear binaural headset (CS-10EM, Roland). The binaural sound was recorded using the microphone in the headset that was fixed to the ears of one of the authors, keeping the sound source at the icon position. By reproducing the recorded sound in both the ears, the listener perceived the same 3D sound image as the real sound, under the assumption that the head related transfer functions are common  [10].

2.2 Aerial Haptic Feedback

The aerial tactile sensation is presented at the icon location by using AUTD. The users actively search for the ultrasound focus, where the acoustic radiation pressure produces a tactile sensation of the virtual icon.

AUTD is a phased array that generates an ultrasonic focal point at an arbitrary position in the air  [5, 6]. The acoustic radiation pressure is proportional to the sound energy density on the skin surface  [3]. Though an AUTD can produce various pressure patterns by controlling the amplitude and phase of each transducer at the frame-rate of \(1\,\mathrm {kHz}\)  [4], a single focus is created in this prototype. The users can perceive a certain stimulus around the focal point, where the tactile feel becomes vivid when the ultrasound amplitude is modulated in the amplitude or the focus position is laterally vibrated on the skin  [15].

Fig. 2.
figure 2

Experimental setup. Five AUTDs were deployed \(30\,\mathrm {cm}\) behind the five spots where the sound-image icon must be placed. The units of the numbers are in cm.

3 Experiment

We experimentally verified that humans could haptically identify the location of a particular icon by following the perception of sound. We measured the accuracy and time for haptic identification.

3.1 Procedure

The experimental setup is depicted in Fig. 2. In this experiment, we displayed five icons and examined whether the participants could identify them, and then we measured the time required for the identification. The icons were placed at \((-40, 20, 20)\), \((-20, 20, 20)\), (0, 20, 20), (20, 20, 20), and \((40, 20, 20)\,\mathrm {cm}\), where the origin was the center of the head, z-axis was parallel to the front direction, y-axis was parallel to the vertical direction, and x-axis is set as forming a right-handed system. The five AUTDs were placed \(30\,\mathrm {cm}\) behind the sound image. A single corresponding AUTD emitted a focused ultrasound with \(200\,\mathrm {Hz}\) sinusoidal amplitude modulation. The unit of the AUTD is an ultrasound phased array (SSC-HCT1, Shinko Shoji Co., Ltd.) with \(14\times 18\) elements. The maximum force displayed by a single unit is \(10\,\mathrm {mN}\).

As sound sources, we recorded the solos of the following five kinds of musical instruments: drum, bass, acoustic guitar, piano, and flute. We defined the drum sound as the target sound, as it was the easiest one to locate the position. We instructed the participants that the drum was the sound of the target sound-image icon to locate. Several sound sources (five at maximum), which included the target sound, were randomly selected and played at random positions in the five locations. As tactile feedback, a single ultrasound focus of the target icon was created at the target position.

The finding-target experiment was conducted considering the following two conditions: auditory-only and auditory-with-tactile. In the auditory-only condition, only the binaural sound was presented, while in the auditory-with-tactile condition, both sound and tactile stimulation were simultaneously presented.

We asked the participants to estimate the location of the target sound-image icon as soon as possible after they recognized the start cue, i.e., the moment the audio was played. In addition, we instructed them to close their eyes during fumble to prevent the visual effect. After determining the position of the icon, the participants indicated the position number among the five options from 1 to 5 with the keyboard. We applied white noise to eliminate the effects of AUTD driving noise.

The participants in this experiment were twelve men in their twenties who had no problems with hearing or health.

Fig. 3.
figure 3

Average of participants’ (a) answer accuracy and (b) required time.

3.2 Results and Evaluations

The results are depicted in Figs. 3 (a) and (b). According to Fig. 3 (a), the answer accuracy was 55% when only the sound was informed of the location to the participants. Despite the high error rate of the “auditory-only condition,” the accuracy rate of the “auditory+tactile condition” was almost 100%. Using the t-test, we examined whether the correct answer rate could be significantly improved by adding a tactile sensation to the sound cue. As a result of the test, the p-value between “auditory” and “auditory+tactile” was smaller than 0.01. This result means that the participants could exactly pinpoint the location of the sound image when they were able to search using tactile sensations.

In addition, Fig. 3 (b) shows that the average required times of each case were almost the same. Accordingly, the average time for “auditory” was \(5.67\,\mathrm {s}\), and that for “auditory+tactile” was \(5.12\,\mathrm {s}\). The standard deviation for “auditory+tactile” was less than that for “auditory.” The standard deviation for “auditory+tactile” was approximately \(2.12\,\mathrm {s}\), and that for “auditory” was \(3.20\,\mathrm {s}\). This indicates that presenting both the stimuli reduced their standard deviations. For clarification, we used Levene’s test for the standard deviations and Welch’s t-test for the averages. As a result of Levene’s test, the p-value between “auditory” and “auditory+tactile” was smaller than 0.01. According to the test, a significant difference was observed between the variances of required time for the two conditions. Additionally, the significance between the average amounts was not noticed, as the p-value from the Welch’s t-test was 0.121.

4 Discussion

As depicted in Fig. 3 (a), in the case of sound alone, the exact position of the icon could not be estimated, and mistakes occurred. On the other hand, the correct answer rate became 100% by adding tactile feedback. This result indicates that sound localization was instantaneous, but it was inaccurate and unreliable. This drawback was compensated via haptic feedback, which offers reasonable cooperation between auditory and haptic perceptions. That is, the participants grasped the approximate position by hearing the sound and determined the exact position by touching the sound-image icon  [7].

As additional information, it was possible to localize the position of the icon only by tactile sensation. To clarify this, we also conducted an additional experiment for the tactile-only condition. The participants and procedures were the same as those described in the Experiment section, and the start cue was an extra monaural audio. In this case, the time to perform localization was \(5.30 \pm 3.39\,\mathrm {s}\). Although the average of the required time was comparable to that for the “auditory+tactile” condition, the variance was significantly longer.

Before this additional experiment, the time cost was expected to be the shortest in the case of “auditory+tactile.” However, this hypothesis was not observed in this experiment. Searching the entire space without prior information was not a time-consuming task, as it only took approximately \(2\,\mathrm {s}\) for the participants to fumble around the area with their hands. Nevertheless, considering that the standard deviation for the “auditory+tactile” condition is the smallest, it was confirmed that the combination of sound and haptic feedback facilitated the search of the icon.

To avoid confusion, we reconfirm the purpose of the combination of auditory and tactile sensations as follows. The role of the sound is to notify the user of the existence and attributes of the icon around the user. Tactile sensation is necessary to determine the exact location and operate the icon. Therefore, even if the localization time for the tactile-only condition is short, it does not mean that the auditory cue is unnecessary.

5 Conclusion and Future Works

In this study, we proposed a sound-image icon and examined the basic feasibility of icon localization. The sound-image icon represents the virtual existence of sound and haptics without visual presentation. Through the research, we confirmed that the participants could search and estimate the location of a single icon in an efficient manner using their tactile and auditory senses. For future work, we will investigate the possibility of efficiently displaying multiple icons.

We used a headset as a sound display device, as this was a feasibility study to examine the effectiveness of the auditory–haptic integration. However, it is also possible to produce binaural sounds in a non-contact manner using airborne ultrasound  [12]. Performing a detailed operation using the sound-image icon was beyond the scope of this paper, and it would be the next important challenge of the sound-image icon.