Abstract
For people with visual impairment, smartphone apps that use computer vision techniques to provide visual information have played important roles in supporting their daily lives. However, they can be used under a specific condition only. That is, only when the user knows where the object of interest is. In this paper, we first point out the fact mentioned above by categorizing the tasks that obtain visual information using computer vision techniques. Then, in looking for something as a representative task in a category, we argue suitable camera systems and rotation navigation methods. In the latter, we propose novel voice navigation methods. As a result of a user study comprised of seven people with visual impairment, we found that (1) a camera with a wide field of view such as an omnidirectional camera was preferred, and (2) users have different preferences in navigation methods.
This work was supported by JSPS Kakenhi Grant Number 17H01803.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
For people with visual impairment, the lack of access to visual information can cause difficulty in their daily lives and decrease independence. To mitigate it, smartphone apps that can tell the user visual information have been developed. VizWiz [5] and Be My Eyes [4] are apps that enable people with visual impairment to ask remote sighted workers or volunteers in supporting them. EnVision AI [6], TapTapSee [12] and Seeing AI [11] are apps that use computer vision techniques [8] to obtain visual information. As of the time of writing this paper, these apps except VizWiz are used by many people with visual impairment and play important roles.
This paper focuses on the latter approach, i.e., the apps that use computer vision techniques. While it has not been argued before, they can be used under a specific condition only. It is only when the user can photograph the object of interest by oneself. Let us confirm this. To take a photo of an object, the user has to know where it is. Of course, the purpose of using the apps is to know what it is. Hence, these apps are used only when “what (it is)” is unknown and “where (it is)” is known. Extending this idea, we find the following three types of visual information exist, as summarized in Table 1.
-
Category (i)— what is unknown and where is known.
In this category, the user can photograph the object of interest by oneself. This type of visual information can be obtained by the current smartphone apps that use computer vision techniques such as [6, 11, 12].
-
Category (ii)— what is known and where is unknown.
A representative task of this category is looking for something. That is, the user knows what the user is looking for, but does not know where it is. As the user does not know where the object of interest is, the user cannot use the current smartphone apps in the same way as category (i). It is because the user needs to move the smartphone here and there to take a photo of the object. Hence, it is expected that using a camera with a wide field of view (FoV), such as a fisheye camera and an omnidirectional camera, is better. As the user already knows what it is, differently from category (i), the app is expected to tell only where it is if found.
-
Category (iii)— both what and where are unknown.
In this category, the user does not expect that the app will provide any visual information to the user. However, if provided, the information is expected to be valuable to the user. Concept-wise, it is similar to the recommendation system used in e-commerce websites such as Amazon.com, because it is expected to introduce products that are potentially interesting and unexpected to the user. Thus, a representative task is finding something valuable and unexpected to the user. In the real world scenario, the app is required to obtain as much visual information all around the user as possible. Hence, similar to category (ii), it is expected that using a camera with a wide FoV is better. A big difference from other categories is that the amount of visual information potentially provided by the app can be much. In other words, the app may find multiple objects valuable to the user simultaneously. However, too much information is just annoying. Hence, the amount of visual information to be provided to the user must be controlled.
Among them, we focus on category (ii) and argue looking for something, which is a representative task of the category, in the following two issues.
The first issue is about cameras. In the task, we assume the user looks for a designated object around the user using an app that uses a computer vision technique to detect the object and guides the user to reach the target object. As the system needs to capture the object with the camera, the task is expected to become easier by using a camera with a wide FoV, such as a fisheye camera and an omnidirectional camera. Hence, in a user study, we investigate if our expectation regarding the cameras is correct.
The second issue is about rotation navigation methods. In turn-by-turn navigation, Ahmetovic et al. [3] have studied rotation errors and found that the participants tend to over-rotate the turns, on average, 17\(^{\circ }\) more than instructed. They have concluded that simply notifying the user when the user reaches the target orientation, like they did in the research, is error prone, and a different interaction, such as continuous feedback, is required. As a follow up, Ahmetovic et al. [1] have investigated three sonification techniques to provide continuous guidance during rotation. However, it is not necessary to instruct by sound. Hence, we introduce three voice instructions and investigate the users’ preferences in the user study.
2 Method
2.1 Prototype System
In looking for something, we implement a computer-vision-based prototype system that guides the user to reach the target object in a step-by-step manner.
-
Step 1: Object detection The system detects an object of the designated category in the captured image. In the user study, we designated easy-to-detect object categories, but only one instance existed in the room, such as a laptop and a bottle. Once the object detection method outputs the bounding box of the target object, the direction of the target object from the user is recorded.
-
Step 2: Rotation navigation The user rotates on the spot until the target object comes in front. By comparing the output direction of the electronic compass with the direction of the target object, the system guides the user to rotate using a rotation navigation method.
-
Step 3: Forward navigation With the guidance of the system, the user advances toward the target object and stops in front of the object. It uses the depth camera to measure the distance to the target object, and speaks the distance periodically, like “1.5 m, 1.3 m, ...” It ends when the user reaches a distance of 0.8 m.
The implemented prototype system consisted of a laptop computer (MacBook Pro) and a camera system in Fig. 1. As shown in Fig. 1(a), one consisted of an omnidirectional camera (Ricoh Theta Z1) used in Step 1 of the above procedure, an electronic compass (Freescale MAG3110 installed on BBC micro:bit), and a depth camera (Intel RealSense D435). The electronic compass was used in Step 2 to quickly sense the user’s direction and promptly give the user feedback. The depth camera was used in Step 3 to measure the distance to the target object. The other was a pseudo smartphone shown in Fig. 1(b). Instead of the smartphone’s embedded camera, we used a web camera (Logicool HD Webcam C615) in Step 1. We used the same electronic compass and depth camera for a fair comparison. To detect the target object, we ran a PyTorch implementation [13] of you only look once (YOLO) version 3 [10], which is a representative object detection method, on the laptop computer. It was trained on COCO dataset [9] consisting of 80 object categories. As the object detection method assumes to input a perspective image, the image captured with the omnidirectional camera was converted to eight perspective images in the same manner as [7]. The prototype system speaks its current state, like “Searching an object. Please stay and wait.” “Detected.” “Measuring the distance.” and “The object exists near you.”
2.2 Existing Rotation Navigation
Ahmetovic et al. [1] have introduced the following three sonification techniques that provide continuous guidance during rotation.
-
Intermittent sound (IS) triggers impulsive “beeping” sounds at a variable rate, which is inversely proportional to the angular distance, like a Geiger-Müller counter.
-
Amplitude modulation (AM) employs a sinusoidal sound, modulated in amplitude by a low frequency (sub-audio) sinusoidal signal. The frequency of the modulating signal is inversely proportional to the angular distance, producing a slowly pulsing sound at large angular distances, which becomes stationary when the target is reached.
-
Musical scale (MS) plays eight ascending notes at fixed angular distances while approaching the target angle.
They concluded that IS and MS when combined with Ping (impulsive sound feedback emitted when the target angle is reached) were the best with regard to rotation error and rotation time.
2.3 Proposed Rotation Navigation
We examine the following five (three voice and two sound) navigation methods.
-
Left or Right (LR) repeatedly (approximately 1.5 times per second) tells the direction toward the target object, i.e., “Left” or “Right.” When the target object comes within 15\(^{\circ }\) in front of the user, it tells “In front of you.”
-
Angle (AG) repeatedly tells the relative rotation angle to the target object, followed by “Left” or “Right.” The front of the user is always regarded as 0\(^{\circ }\). For example, if the target object exists at an angular distance of 60\(^{\circ }\) on the right-hand side of the user, the system speaks “60\(^{\circ }\), right.” After the user rotates by 15\(^{\circ }\), it speaks “45\(^{\circ }\), right.” In front of the target object (within 15\(^{\circ }\)), it tells “In front of you.”
-
Clock Position (CP) is similar to AG but uses the clock position. Taking the same example as AG, it speaks “2 o’clock.” In front of the target object (within 15\(^{\circ }\)), it tells “In front of you.”
-
Intermittent Beep (IB) is similar to IS of [1]. It triggers impulsive “beeping” sounds at a variable rate, which is inversely proportional to the angular distance. The rates in the front (15\(^{\circ }\)) and back (180\(^{\circ }\)) were approximately 5 Hz and 1.2 Hz, respectively. IB is designed to use earphones; beeps are played on only the left or right earphone to indicate the rotation direction. When the target object comes within 15\(^{\circ }\) in front of the user, it plays beeps sounds at a rate of approximately 8 Hz on both earphones.
-
Pitch (PT) plays sounds with a variable pitch. In our implementation, the front and back pitches were 1570 Hz and 785 Hz (six and three times of C4 in scientific pitch notation), respectively. In contrast with MS of [1] that plays eight discrete notes, PT plays continuous notes. Same as IB, PT plays sounds on only the left or right earphone to indicate the rotation direction. In front of the target object, PT behaves in the same manner as IB.
3 User Study
We performed a user study comprised of seven people with visual impairment. As summarized in Table 2, the participants consisted of four males and three females, ages 23 to 48. Six were totally blind, and one had low vision. The user study consisted of the following four parts.
1. Instruction of the experiments
We told the participants that our research topic was looking for something and gave a brief overview of the experiments.
2. Pre-study interview: Survey on looking for something
We asked the participants about looking for something. This interview was performed for every two persons except A. That is, the interview groups were [A], [B and C], [D and E], and [F and G]. The questions and answers are summarized in Table 3. Answers of Q7 and Q9 are shown in Tables 4 and 5.
The answers of the participants are summarized as follows. Five out of seven participants lived together with someone (Q1). Among them, two lived with sighted persons (Q2). While they all looked for something every day (Q3), they did not encounter trouble every day (Q4). They all looked for something at home, and three did it in other places (office or school, and outside) (Q5). They all groped to look for something, expecting to find it in arm’s reach, while four asked a sighted person if available (Q6). Five mostly looked for a smartphone, while earphones and other stuff were also often looked for (Q7). Required time to look for lost stuff was of variety (Q8). Some answered that they gave up looking for if it took more than 5 min. (Q8). The lost stuff was found in the pocket of a jacket and a bag, and on a chair and a table (Q9). Losing stuff was mostly caused by wrongly remembering and forgetting where it was placed (Q10). They all answered that their remedy to avoid losing stuff was to fix the place, while two answered to keep the room clean (Q11).
3. Experiment 1: Comparison of five rotation navigation methods
Differently from the pre-study interview, the following two experiments were performed for each participant. In this experiment, we asked participants to use five rotation navigation methods one by one through Steps 1 (object detection using the omnidirectional camera) and 2 (rotation navigation) in Sect. 2.1. As IB and PT were designed to use earphones, for a fair comparison, participants used earphones for all navigation methods. Figure 2 shows how the experiment was performed. Table 6 shows their preferences on a 5-point scale, in which a large number means better. Besides, their comments on the five navigation methods and ideas about easy-to-use navigation methods are shown in Tables 7, 8, 9, 10, 11 and 12.
Table 6 shows that the participants’ preferences were of variety. That is, all navigation methods except LR were selected as the best by at least one participant. Related results are reported in two papers; musical experience affects the users’ behavior [1]; expertise affects interaction preferences in navigation assistance [2]. In our experiment, while we did not ask their expertise, from their commentsFootnote 1, we can see that the participants have their compatibility with navigation methods. These imply that no single best method for everyone exists, and personalization of user interfaces is vital. We also asked the participants if they hesitate to wear earphones on both ears, and found that one (D) did not hesitate, four (A, C, F, and G) did not if they are at home, two (B and E) did.
4. Experiment 2: Selection of camera
We asked participants to use each of the two camera systems and complete the 3-step finding process in Sect. 2.1. They used the best navigation method selected in experiment 1 for each participant but had the freedom to use or not to use earphones. Table 13 shows an omnidirectional camera was preferred by six, while the pseudo smartphone by one. Tables 14 and 15 show the participants’ comments on the camera systems. Six (all but C) commented on the difficulty of using the pseudo smartphone in looking for something. In contrast, they all, including participant C who preferred the pseudo smartphone, found advantages of the omnidirectional camera, while three (A, C, and F) commented its heaviness. Hence, we conclude the omnidirectional camera has advantages in the task.
4 Conclusions
In this paper, we focused on apps that use computer vision techniques to provide visual information. We pointed out that the current smartphone apps can only be used under a specific condition, and categorized the tasks of obtaining visual information into three. As a representative task of a category, we focused on looking for something. In the task, we proposed a prototype system that used an omnidirectional camera and the use of voice in rotation navigation. A user study comprised of seven people with visual impairment confirmed that (1) a camera with a wide FoV is better in such a task, and (2) users have different preferences in rotation navigation. The latter implies that no single best method for everyone exists, and it is vital to personalize user interfaces.
Change history
04 September 2020
The original version of this chapter was revised. The introduction was updated because important information, such as a reference, was missing.
Notes
- 1.
Let us highlight some comments. Participant B on CP: He preferred to put the target, instead of himself, at the 12 o’clock position. Participants C, E, and G: C and E were not good at AG but good at CP, while G was the opposite. Participant A on IB: As he imagined a 3D audio effect, hearing the sound played on only left or right, he felt the target object was on the side.
References
Ahmetovic, D., et al.: Sonification of rotation instructions to support navigation of people with visual impairment. In: Proceedings of the PerCom (2019)
Ahmetovic, D., Guerreiro, J., Ohn-Bar, E., Kitani, K.M., Asakawa, C.: Impact of expertise on interaction preferences for navigation assistance of visually impaired individuals. In: Proceedings of the W4A (2019)
Ahmetovic, D., Oh, U., Mascetti, S., Asakawa, C.: Turn right: analysis of rotation errors in turn-by-turn navigation for individuals with visual impairments. In: Proceedings of the ASSETS (2018)
Bigham, J.P., et al.: VizWiz: nearly real-time answers to visual questions. In: Proceedings of the UIST (2010)
Iwamura, M., Hirabayashi, N., Cheng, Z., Minatani, K., Kise, K.: VisPhoto: photography for people with visual impairment as post-production of omni-directional camera image. In: Proceedings of the CHI Extended Abstracts (2020)
Leo, M., Medioni, G., Trivedi, M., Kanade, T., Farinella, G.: Computer vision for assistive technologies. Comput. Vis. Image Underst. 154, 1–15 (2017)
Lin, T.Y., et al.: Microsoft COCO: common objects in context. arXiv preprint arXiv:1405.0312 (2014)
Redmon, J., Farhadi, A.: YOLOv3: an incremental improvement. arXiv preprint arXiv:1804.02767 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2020 The Author(s)
About this paper
Cite this paper
Iwamura, M., Inoue, Y., Minatani, K., Kise, K. (2020). Suitable Camera and Rotation Navigation for People with Visual Impairment on Looking for Something Using Object Detection Technique. In: Miesenberger, K., Manduchi, R., Covarrubias Rodriguez, M., Peňáz, P. (eds) Computers Helping People with Special Needs. ICCHP 2020. Lecture Notes in Computer Science(), vol 12376. Springer, Cham. https://doi.org/10.1007/978-3-030-58796-3_57
Download citation
DOI: https://doi.org/10.1007/978-3-030-58796-3_57
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58795-6
Online ISBN: 978-3-030-58796-3
eBook Packages: Computer ScienceComputer Science (R0)