Recent trends in computer vision-driven scene understanding for VI/blind users: a systematic mapping

During the past years, the development of assistive technologies for visually impaired (VI)/blind people has helped address various challenges in their lives by providing services such as obstacle detection, indoor/outdoor navigation, scene description, text reading, facial recognition and so on. This systematic mapping review is mainly focused on the scene understanding aspect (e.g., object recognition and obstacle detection) of assistive solutions. It provides guidance for researchers in this field to understand the advances during the last four and a half years. This is because deep learning techniques together with computer vision have become more powerful and accurate than ever in tasks like object detection. These advancements can bring a radical change in the development of high-quality assistive technologies for VI/blind users. Additionally, an overview of the current challenges and a comparison between different solutions is provided to indicate the pros and cons of existing approaches.


Introduction
According to the study published by the World Health Organization (WHO), there are approximately 285 million visually impaired (VI) people around the world of which 39 million are completely blind [1]. Many of these people have difficulty handling some of their daily tasks. These tasks include navigating in an environment, knowing about obstacles in their path and identifying objects around them. During the past years, many researchers have been working on assistive solutions which use technologies like RGB-D cameras, ultrasonic sensors, optical beacons, LiDAR, RFID tags or WiFi access points, to ease the tasks of everyday life for the VI/blind.
Additionally, during the past few years, the use of computer vision and deep learning in assistive technologies for VI/blind people has noticeably increased in popularity among researchers. These technologies have provided game-changing possibilities for making more effective and useful assistive tools. In this systematic mapping, a study of 105 recent papers is conducted to give an overview of the state-of-the-art computer vision-based assistive solutions. Moreover, a comparison of these papers is performed to identify advantages and disadvantages and for proposing possible improvements.

Related work
This section provides a brief discussion of relevant literature reviews, surveys and research on assistive technologies for the VI/blind. To the best of our knowledge, there are not many thorough literature reviews or systematic mappings about this topic. Nonetheless, there are some articles that to some extent review existing state-of-the-art assistive tools. For example, in one of the most recently published papers, Aileni et al. [2] presented several assistive technologies for people with low visual acuity. They discussed how these technologies could be integrated into wearable systems and transportation systems using the Internet of Mobility (IoM) and Internet of Mobile Things (IoMT). The review is about assistive solutions that are based on vision, audio and tactile senses. However, only a limited number (9 solutions) of 1 3 computer vision-based solutions are mentioned in the paper. The solutions they reviewed mostly utilize computer vision and speech recognition to bring an audio-based augmented reality experience for the users to assist them. Orcam or Horus [3], for instance, are assistive tools that have a camera and voice control to help users detect faces, recognize objects and read texts. An assistive app called Seeing AI [4] was also mentioned. It has functionalities similar to the aforementioned solutions, but using the power of smartphones to help the VI/blind users. Aira [5] is a different kind of assistive solution discussed in the review that provides live access to real agents that can see through the camera of the blind user's smartphone and provide different kinds of assistance, like scene description or text reading.
Chanana et al. [6] conducted a systematic review of assistive technologies for pedestrians with visual impairment that were in the market in 2017. Their work is mainly focused on the systems that use laser, infrared or ultrasonic sensors for obstacle detection. Solutions mentioned in the paper are obstacle detection devices that mount on a white cane. These devices can detect obstacles that are not ground level and warn the user using low and high pitch sounds or haptic feedback on their body. These solutions mostly have the limitation of obstacle detection of up to 1 or 2 meters of distance. Sadly, only 2 assistive devices out of the 9 mentioned in the paper were commercially available at the time of our review. One obstacle detection system mentioned was EYECane [7] that uses computer vision for assistive purposes, but it was at the prototype stage and could not be applied to a realworld environment at the time the paper was published. Two more recent assistive canes are discussed in [8]. WeWalk [9], which can be connected to smartphones via Bluetooth and uses ultrasonic sensors for obstacle detection, and assistive canes that use radio frequency identification (RFID) tags like [10]. However, due to the necessity for RFID tags, their use is confined to small or indoor environments.
Kau and Garg [11] published a systematic review in which they briefly discuss researches about camera-based and sensor-based assistive devices. They point out some of the critical challenges of the solutions like low-light detection, prediction of dynamic obstacles, large size of the prototypes and high costs for creating them. However, their paper lacks an in-depth description of the technologies and does not consider the availability of user evaluations in the studies.
Kuriakose et al. [12] reviewed different modalities for navigation of VI/blind people in different environments. They assessed multimodal systems and their benefits in comparison with unimodal approaches. They also discussed the pros and cons of different modalities like tactile, visual, aural and haptic in navigation systems for VI/blind users. In another review by the same authors about tools for navigation of VI/blind users [13], they suggested some recommendations to improve these solutions. Their suggestions are: A robust real-time object detection, availability of multiple options for feedback, reducing learning time for the user, portability, privacy, avoiding social stigma and not overloading the user with information.
In a similar study by Romlay et al. [14], the maturity level of various electronic travel aids (ETA) in 70 studies is discussed. However, only one (Guido [15], a smart walker for the blind) of the products mentioned was successful in preproduction unit sales in the USA. Most of the other assistive solutions in their review were tested in a simulated environment without real subjects or constructed hardware.
Finally, Khan and Khusro [16] conducted a review about smartphone-based assistive solutions for VI/blind. They discussed the challenges and issues of different kinds of solutions like vision substitution-based, sensor-based, sonar-based and augmented reality-based interventions. The discussed issues are mostly about the usability of the solutions, for example, inconsistency in interface elements, difficulty of text entry or modification, device incompatibility, illogical order of items and inadequate mapping of feedback with UIs, etc. Additionally, the importance of user-centered design (UCD) for the development of assistive solutions is highlighted, because in this approach the system design begins with the explicit understanding of the user needs, and the user is involved in the process of development and evaluation.
Existing reviews have demonstrated that the availability of assistive solutions for VI/blind users in the market is limited. Moreover, the current available solutions have noticeable problems from the usability point of view.
In our research, we concentrate on aspects of assistive solutions for the VI/blind that have not been thoroughly investigated before. Our focus is on the recent improvements in computer vision methods for scene understanding and the impact they have on assistive solutions for the blind. We also examine the existence of user evaluations for the proposed solutions, as we consider this a critical component in the development of assistive solutions.

Systematic mapping study
Following Khan and Khusro [16] we wanted our research to focus on user needs and requirements. Ntakolia et al. [8] provides a list of user requirements after a thorough literature review on studies about requirements elicitation for assistive solutions for VI/blind people. The requirements considered in our research are listed in Appendix 5. Obtaining information from the surroundings is considered as one of the most important requirements for the assistive solutions for VI/blind users, because it helps them to create a more accurate mental map of the environment [17]. Therefore, the 1 3 capability of computer vision technologies to analyze and understand a scene has been considered as the core of our mapping study.
The method used for this systematic mapping study is proposed by Petersen et al. [18]. A systematic mapping study's goal is to obtain a detailed overview of a research subject, provide a review of existing literature, identify research gaps and gather evidence for possible research directions [19]. Consequently, the goal of this research is to gather relevant publications that are related to computer vision-driven scene understanding for VI/blind users, and present an overview of the status quo to find the research gaps.
A number of research questions were defined to specify the objective as follows: 1. What are the current computer vision solutions for scene understanding? 2. How are computer vision methods used to assist blind users with their daily activities? 3. How have proposed solutions been evaluated?
After defining the research questions, the process of collecting papers began.

Search strategy
The major terms used for performing the paper search were "Computer vision," "Visual Impairment" and "Accessibility." Later, by combining these three terms and using similar keywords, the papers were collected. The list of keywords used in the search is listed in Table 1. The scope of the research considered the papers that were published after January of 2017 in journals, academic conferences, workshops and academic books. Web pages, non-academic publications and patents were excluded from the research scope. The search strategy is outlined in Table 2. The reason for choosing the papers that have been published after 2017 was that computer vision methods for object detection have improved drastically since that year and brought significant improvements in the scene understanding research for the VI/blind. Figure 1 shows the summary of the selection process. The initial number of search results retrieved from all the databases was very large (around 27,000 for papers published after 2017). In order to select suitable papers for the research, the first 50 papers in an ordering based on their relevance and publishing date (most recent) were analyzed according to the exclusion criteria. If more than 10 papers between those 50 were related to the topic, the next 50 papers were also analyzed. This process continued until there were less than 10 papers found related to the research goals. In the application of the exclusion criteria (Table 3), the abstracts, introductions and conclusions of the papers were considered as the main source. In some cases, other parts of the papers were also read to obtain a better comprehension. After removing the duplicates (992 results) and applying the exclusion criteria, 180 papers were left for full text reading, out of which 105 were ultimately useful for the review.
The reason for excluding a large number of papers was that we were focused on the studies that provided a useful and tangible solution. This means that we skipped the papers which were about frameworks or solutions that missed implementation (e.g., prototype, proof of concept (POC) or simulation).

Data extraction
For the papers that were chosen to be read completely, we needed a template for data extraction so that the comparison and tracking of the information in the papers would be easier. According to the research questions, a number of categories were defined to compare existing solutions. The three major categories are "Scene Understanding," "Assistance services" and "Evaluation." "Scene Understanding" is related to the first research question and is focused on the level of the perception that the system has from the surrounding environment of the user. The "Assistance services" category is related to the second research question, and defines how the understanding of the outside world by the system is going to assist VI/blind users. Finally, the "Evaluation" category collects information about the way in which the solution was evaluated.
In Table 4 the main categories and subcategories are listed.

Scene understanding
Object recognition This is one of the most important features of the assistive solutions for scene understanding. There are different methods for recognizing an object using computer vision and each of them has its advantages and disadvantages. In this category, we check the availability of object recognition in each solution.
Obstacle detection Obstacle detection is another primary feature that must be included in order to warn users and avoid obstacles in the environment. In this category, we analyze the approach of each solution for detecting obstacles, either using sensors and/or cameras.
Depth detection This is one of the most challenging aspects in the development of assistive solutions for scene understanding. Estimating in real time the 3D location of objects in the physical world using computer vision is a 1. Papers that are not relevant to the primary research goals. This is obtained after reading the abstract, introduction and conclusion of the papers. 3. Papers published in non-academic web pages, personal blogs or patents 4. Papers without providing a solution for obstacle detection and/or object recognition 5. Solutions that miss implementation 6. Not written in English  Hardware used This category focuses on the hardware components that were used in each solution. Different researchers made use of various kinds of devices depending on the budget and purpose of their project.

Assistance services
Type of assistance In this category, we summarize how each solution provides assistance for the user and in which kind of tasks they can help them.
Modality This category considers the methods used in the system for interacting with the users. For example, whether they provide text to speech, a pitch sound, speech recognition, haptic feedback, etc.

Evaluation
Technical evaluation This kind of evaluation is focused on analyzing the solutions from an objective point of view in order to test the performance of the algorithms used. The metrics used for evaluation and the environment in which the evaluation was performed are reviewed.
User testing Testing a system with end users is essential. In this category, we describe if and how the user testing of each solution was performed.

Mapping results
The process of gathering data from papers was based on the categories defined in Table 4. Figure 2 shows the distribution of papers published in each year. The number of papers published in 2020 that meet our research goal is almost four times higher than that of 2017. We did not include the number of relevant papers published in 2021 in Fig. 2 because our mapping did not cover the whole year. The constant increase in the number of published papers proves that the topic has been getting more attention during the last years. This is mainly because of the improvements in deep learning algorithms, computer vision cloud services and mobile devices.

Scene understanding
Scene understanding for the visually impaired/blind has some differences with the classical approach of scene understanding. It is very important that the process of analyzing and perceiving the environment by the system occurs in a level that can be beneficial for the VI/blind users. For instance, it is crucial that the algorithms have sufficient swiftness and accuracy in detecting/recognizing obstacles and objects to give prompt feedback to the user when it is necessary. Additionally, semantic understanding of the environment by the system and finding the relations between different objects in a scene are important to give a comprehensible description of the environment to a user that has no access to visual cues.
After comparing the various research approaches, we came to the conclusion that in the early years, most of the solutions were focused on obstacle detection or image enhancement techniques. Image processing was used in order to make the images perceivable for the visually impaired people. For instance, in [20][21][22] researchers used techniques like contrast enhancement, image mapping and magnification. Their aim was to increase the visibility of the important features of an image (e.g., edges). However, these methods had some limitations. For example, the algorithms added too much noise to the image or amplified the contrast of some parts of the image that were not necessary for scene understanding. Lately, object recognition, which is a more efficient method for scene understanding, has become more popular thanks to technological advancements (Fig. 3). The percentage of solutions with object recognition has been increasing in the last years, as shown in Fig. 4.

Obstacle detection
For detecting obstacles in the environment, there are usually two different approaches: distance sensors (e.g., ultrasonic, LiDAR and infrared (IR) triangulation) or camera-based (e.g., monocular or RGB-D cameras) techniques. In some cases, the combination of both techniques is used for better accuracy. Detecting the exact distance of an object/obstacle from the user is one of the complications of creating assistive solutions for blind people. This is because the 3D location of an object in the real world is often inferred from a 2D image taken by a monocular camera. Ultrasonic sensors, point clouds, RGB-D images (taken by stereo vision cameras) or mathematical estimations are the solutions that have been used for tackling this problem. In this section, different approaches are discussed.
Camera-based techniques: Researchers in [23,24] applied Stixel-World [25], which is a method that is mostly used in autonomous cars, to help VI/blind people navigate in an environment. Stixels algorithm provides environmental awareness based on the depth images provided by an RGB-D camera. RGB-D images are captured using cameras that work in association with sensors for distance detection. Stixels segment objects in the image around the user in vertical regions according to their depth disparity in the environment. Afterward, using object recognition techniques, Stixels semantically categorize objects in the scene.
On the other hand, researchers in [26] used Apple's ARKit 2 [27] framework to find the 3D location of obstacles using planes and point clouds. ARKit 2 can detect the vertical and horizontal planes. Therefore, it can differentiate the ground from the other planes that could be potential obstacles. Additionally, point clouds (which are a set of points that represent the objects' salient contours in an image) can be used to detect obstacles that have more complicated shapes. These high level computer vision features of augmented reality enable the possibility of converting 2D points provided by phone cameras to 3D position in order to estimate the approximate position of the obstacles.
It is also possible to mathematically calculate the distance of an object in front of the camera. Lin et al. [28] used a method proposed by Davison et al. [29] for object distance estimation (Equation 1).
In this formula, f is the camera focus, k v is the pixel density (pixel/meter), h is the camera height from the ground, v 0 is the center coordinate of the formed image, v is the distance from the camera to the target object's ground coordinates, and Z is the distance between the target object and camera location. In [28], they claimed that this method has a high accuracy in distance detection and can detect an object's distance up to more than 10 meters.
The solution in [30] is based on another mathematical method that uses depth images and fuzzy control logic for the approximate measurement of obstacles' distance. This solution divides the frame into three parts (right, left and center), categorizes the location of obstacles in three different categories and provides audio feedback for the user according to them. If the user faces any obstacles, the system makes decisions in order to avoid them based on 18 different fuzzy navigation rules that depend on the location and distance of the obstacles.
Besides the mentioned techniques, it is also possible to estimate the depth from monocular images using deep learning techniques. Recently, this approach has received more attention due to the rapid advancements of deep learning methods. Facil et al. [31] built a depth prediction network that provides a depth map from a single RGB image. Their predictions work with images taken with different camera models. Various kinds of NNs(Neural Networks), like CNNs [32] and RNNs [33], have been implemented showing the effectiveness of monocular depth estimation. In [34], a CNN network is used for calculating the distance of the obstacles. Their method works more accurate than some devices like Kinect, according to their comparisons. We did not come across many assistive solutions that make use of these techniques, but they will surely be applied more in the future development of assistive solutions for the blind since it is a cost-effective and at the same time robust approach. Distance sensor-based techniques: Using sensors has been a more common approach for obstacle detection in comparison with camera-based techniques. Among the different kinds of sensors, ultrasonic is very popular. This is because of their accuracy, low cost, low power consumption and ease of use [35].
Ultrasonic sensors have a transmitter that generates sound waves with a frequency that is too high for human ears to hear. Then, the receiver of the sensor waits for the rebound of the sound and, based on that, the distance with the obstacle will be calculated. These sensors are better at detecting transparent objects compared to light-based sensors or radars. As an example, Bharatia et al. [36] used ultrasonic sensors for the detection of knee level and low-lying obstacles. One of the drawbacks of using sensors is that they cover a short range and can only detect close obstacles, which makes them more suitable for indoor environments.
Combined distance sensor and camera-based techniques: Lately some researchers have combined these two approaches. Hakim and Fadhil [35] made use of ultrasonic sensors and RGB-D cameras for obstacle detection. Their electronic travel aid (ETA) processes the data received from an RGB-D camera using a Raspberry pi 3 B+ which has ultrasonic sensors attached to it for distance detection. The combination of these two approaches provides a more accurate obstacle detection.
Appendix 1 contains the detail of the different obstacle detection techniques that were used in the reviewed papers.

Object recognition
During the past years, the use of deep neural networks (DNN) for object detection, especially the latest CNN models such as ResNet [37] or GoogLeNets [38], has considerably extended the potential of computer vision for developing assistive solutions for the VI/blind. They have a notably superior performance that makes real-time object recognition more achievable in comparison with shallow networks such as AlexNet [39]. In addition, CNNs can learn high level semantic features from the input data automatically, optimize multiple tasks simultaneously such as bounding box regression and image classification, and solve some of the classical challenges of computer vision [40].
There are different ways to implement an object recognition solution. One common approach is to execute the process of recognition remotely on cloud services like Google Cloud Vision, Microsoft Azure Computer Vision, Amazon Rekognition [41], etc. These services are already trained with huge datasets that enable improved performance. Bharatia et al. [36] proposed a mobile system that uses Google Cloud Vision to recognize objects, texts and faces. There are also solutions that provide their own cloud computing algorithms for image processing. For instance, researchers in [42] made a remote object detector using an improved version of ResNet [43] network.
These services mostly have a Representational State Transfer (REST) Application Programming Interface (API) to handle communication between the client and the server. Companies that provide these services usually calculate the costs based on the number of requests sent to the server by the client.
Local image processing is another approach used in many solutions which undertakes the computations related to the object recognition on the client side. Nevertheless, this approach is usually confined to a limited number of objects due to hardware limitations. Dosi et al. [44] made an android app for VI people that locally recognizes objects. They used MobileNets [45] for object recognition which is a neural network for mobile and embedded vision applications. Single Shot Detector (SSD) [46] with MobileNets architecture is another popular algorithm which is used in a considerable number of papers for object recognition and can bring fast and efficient results. SSD can detect multiple objects in an image by taking a single shot. Visual Geometry Group (VGG16) [47], which is a convolutional neural network model, is the base network of the SSD algorithm, followed by a multi-scale feature layer for object category and bounding box predictions. SSD generates anchor boxes in various sizes and predicts objects based on their size. Larger objects are detected by deeper network features and smaller ones are detected by the shallower networks. The inference time of the SSD512 method is 22 milliseconds with about 76.8% Minimum Average Precision (mAP) on Pascal VOC2007 dataset of images, which shows its competence and swiftness in object detection [46]. YOLO [48] is a CNN-based object detection technique and uses Darknet which is an open source network framework written in C and CUDA. YOLO divides an image into SxS grids and generates B bounding boxes for each of them. Afterward, it predicts the probability of classes for objects and their corresponding bounding boxes. In [48], YOLO VGG-16 has an inference time of 47 milliseconds with 66.4% mAP on Pascal VOC2007 dataset of images, which is close to the SSD performance.
Different versions (YOLO v3 (Tiny) [49], YOLO v2 [50], YOLO 9000 [51]) were used in different researches. The accuracy and number of objects that could be detected varies for each version. For instance, TinyYolo is made for mobile devices and is able to recognize a lower number of objects compared with the other versions. YOLO and SSD are very popular because they are achieving a balance between accuracy and speed. Figure 5 shows the distribution of object recognition algorithms in the reviewed solutions. As it is shown in the figure, YOLO is the most popular solution for object recognition. The remaining "Other Neural Networks" algorithms which are used in the solutions are various local object recognition methods that are using Inception-v3 [52], stochastic gradient descent (SGD) algorithms on Keras [43,53], OpenCV [54] functions [55] or Computer Vision System Toolbox of MATLAB [56,57], to name a few. Appendix 2 contains the object detection methods used in the reviewed solutions.
It is important to mention that the quality of the input images sent to these algorithms can affect their performance noticeably. For example, images taken at night or in low light conditions can have a high noise level or distortion that can reduce the accuracy of object detection algorithms. To overcome this problem, there are different methods. For instance, [35,58,59] used a Gaussian filter that blurs the image in order to remove the noise and unnecessary details in the images before sending them to the algorithm. Researchers in [42] used the stereo-image quality assessment (SIQA) approach to collect images with higher quality before sending them as input to the system. They used a disparity map and the traditional approach of Hough transform for image evaluation. IQA evaluation methods are divided into two main categories, deep learning methods such as CNNs [60], and traditional feature extraction methods that help detecting different parts of the desired objects in an image [61]. These two methods are used to make a score prediction for the image quality assessment.

Assistance services
It is crucial to consider how the information regarding the environment obtained by sensors, cameras and algorithms is transferred to the user. The assistance provided by these solutions should be swift, accurate and easily understandable for VI/blind people. The assistive solutions reviewed in this paper help users in different tasks relating to their daily life.
The modality of these assistive solutions is mostly based on audio and tactile feedback. Researchers in [50,51,62,63] used binaural audio for scene description. This brings a sense of audio-based augmented reality to the user. Users can hear and feel the approximate 3D location of the object/obstacle based on the audio they hear using normal/bone conduction headphones. In these solutions, the device/app names all the objects in the scene or a specific object that the user is looking for. Moreover, some solutions provide vibrotactile feedback. Saurav and Niketa Gandhi [64] used two servo motors for vibrating feedback so that when there is an obstacle on the left, the left side of the user vibrates and when there is an obstacle on the right, the right side vibrates. In another research [26], the user is warned about the obstacles with an audio beep. The pitch of the sound changes based on the size and proximity of the object.
There are solutions that carry out the scene description for a specific purpose. These solutions might be limited to certain tasks but the overall performance is better because the scope is constrained. Researchers in [43] detect stop lights and crosswalks to help users with crossing streets.
Many solutions provide navigation assistance for users. For instance, [62] provides a navigation service that guides the user through a pre-scanned environment. A virtual assistant repeatedly states "follow me" and, based on the intensity and the direction of the sound in the headset, the user navigates through the environment.
Besides the scene understanding feature, emergency calls are provided in some solutions which can be very useful for the users. Suresh et al. [65] implemented a speech recognition module that can get orders from users to make an emergency call to predefined users or the closest emergency center based on the location of the user. GPS was used for tracking the live location of the user. Researchers in [36,64,66] undertook a similar approach by placing a call button on the assistance device.
Arakeri et al. [67] proposed a text recognition module in their solution that reads the text that is in front of the user. They used Google Vision API [68] for this purpose. The same functionality has been provided in [69] using Tesseract OCR library [70] and in [71] using Microsoft's Computer Vision API [72]. However, the usage of this feature can be confusing for the user since it is not possible for a VI/blind user to distinguish the exact position of a text in the real environment.
Rahman et al. [73] have a face detection module in their solution besides object recognition. They used Multi-task Cascaded Neural Networking (MTCNN) that can detect faces with a 70-100% accuracy.
The kinds of assistance provided in the revised solutions are listed in Appendix 3.

Context of use: the ICF framework
In order to come up with novel and useful solutions for any kind of disability, it would be necessary to understand the contexts in which they could be applied and the scope of limitations that a disabled person faces. The International Classification of Functioning, Disability and Health (ICF) framework [74] provides a standard language for defining different kinds of disabilities. In the framework, limitations of disabled people in their activities are divided into different categories. In our research, we tried to assess the usefulness of the proposed solutions to VI/blind users by mapping their contributions with the different tasks in this framework. In the case of scene understanding for VI/blind people, the main relevant categories are as follows: • Mobility This is mainly about moving the body and going from one position to another. According to the framework, it includes tasks like "walking and moving," "changing and maintaining body position" and "moving and handling objects." We analyzed the compatibility of the above mentioned ICF categories with the solutions provided by different researchers. Mobility is the most explored category, with researches in [62,75,76] helping VI/blind with navigating in different environments (indoor/outdoor). They combine various methods like GPS, obstacle detection and object detection to help the VI/blind in tasks like "walking and moving" from one location to another. Moreover, [57] defines two user stories that are compatible with the "Domestic Life" and "Self-care" categories. In the first user story, the blind user receives assistance for detecting the right kind of pasta at home, which helps her with cooking and eating. In the second user story, a man wants to buy a specific kind of biscuits and using the assistive device he can find them in the store.
Their solution provides feedback about the differences in objects that have the same tactile appearance. By comparing the categories of the ICF framework with the kinds of assistance provided in the revised solutions, we noticed that their scope is generally not well defined in terms of the context of use. For instance, in papers like [50,65,75] object detection could potentially cover some tasks in "Self-care" or "Domestic life." However, these specific use cases are not mentioned. Furthermore, researchers in [73,77,78] provide face detection, a service that could be related to the "Interpersonal Relations and Relationships" category, but the final purpose is not clear. It appears that the researchers have focused their efforts more in proving the technical feasibility and performance of the solutions than in demonstrating how the solutions can help VI/blind people in their daily life. This fact becomes more evident when we analyze the way in which the solutions have been evaluated.

Evaluation
Based on our review, the evaluation process of an assistive solution should tackle two main aspects. One is the technical evaluation and validity assessment of the system from a technical point of view, and the other is the testing of the system with the target end users to evaluate the performance and usefulness of the solutions.
Despite the fact that technical evaluation is a matter of importance, user testing is equally essential. This is because the ultimate goal of any assistive solution is to be useful for VI/blind people. Unfortunately, user testing is neglected in a considerable number of papers. Figure 6 shows the percentage of papers that only undertook the technical evaluation, and the ones that performed both.

Technical evaluation
In the development process of any system, it is crucial to test and measure its performance with objective metrics. In the case of assistive solutions, it is essential to measure the accuracy and efficiency of algorithms in the detection and recognition of obstacles and objects. It is common to evaluate the performance of object detection algorithms using the calculation of precision, recall and mAP (the mean of average precision calculated out of precision and recall metrics).

Fig. 6 Evaluation types
A model with high precision returns more correctly predicted results than irrelevant ones and a high recall means that the model returns most of the relevant results. In other words, precision is a measure for quality, while recall is a measure for quantity. Some papers, such as [59] and [26], measured the accuracy of their approach based on recall and precision. Others, like [43,76,79], used mAP to measure the effectiveness and accuracy of the algorithm they used.
Additionally, other works [80][81][82] tried to compare their solution with other state-of-the-art solutions based on the accuracy, number of detected objects, kind of recognition (object, text, face, obstacle, etc.), average distance, convenience and so on, in order to assess the functionality of their solution and ascertain the competence of their approach.
In technical evaluation, researchers usually test their system by simulating a user scenario. For instance, Wang et al. [24] used a prerecorded video to test their Stixels model.

User testing
The most important aspect in the evaluation of a solution for VI/blind people is to test it with the target users. This is because assistive solutions are for a target group that is different from average users. A person without the disability cannot evaluate the solution properly, given that, due to the variation in sensory input, they do not possess the same mental models of the environment and qualities of embodied experience as people that genuinely have visual impairment. Sadly, this critical point is neglected in a noticeable number of research projects. Up to 27.5% of the papers that were included in this review were testing their solutions with blindfolded/sighted users. Figure 7 summarizes the visual perception status of the testers in evaluations.
Nonetheless, there are some works that report testing with VI/blind users. Wang et al. [83] performed 100 hours of testing for the navigation module of their solution with simple and complex paths. Afterward, they asked the 5 blind testers to fill in questionnaires regarding the comfortability and effectiveness of the system. In another research [26], the solution was tested with 13 participants that were VI/blind. They conducted 15 minutes test sessions and asked users to fill in a Likert-like scale questionnaire. Furthermore, [84] reports a detailed evaluation and found that there is a significant difference between early blind and late blind testers, and that the former group could perform better. The authors also came to the conclusion that vibrotactile cues are less efficient in comparison with auditory cues for detection in the central region of the environment.
The research of Guerreiro et al. [85], which presents a navigation robot for the VI/blind, also includes a satisfactory user study. They evaluated their solution with 10 blind participants. The tasks for testing are well designed and explained in the paper. After the testing process, user feedback was obtained about confidence, safety and trust in the solution using questionnaires.
The results obtained from questionnaires are subjective and contain personal opinions of the users. In some cases, personal opinions can be considered unreliable when they are obtained from blindfolded users instead of blind/VI users. For example, in [23] researchers used the NASA-TLX [86] evaluation questionnaire that analyzes tasks based on Mental Demands, Physical Demands, Temporal Demands, Own Performance, Effort and Frustration. They noticed that users reported unexpected low scores on physical demand. This is because they were blindfolded and the tasks appeared to be more frustrating for someone with normal vision in comparison with a VI/blind user.
The main methods used for testing the solutions were as follows: • Surveys: Asking a series of questions from the users, usually with Likert-type scales, to obtain their feedback; • Think-aloud protocol: Users share their opinion about the solution while performing the test tasks; • Controlled environment testing: Testing the solution in a laboratory environment that was designed by the researchers and observing the user's behavior to detect the advantages and problems of the prototype; • Field experiments: Testing the solution in real-world settings with the subjects. In this type of testing, users might make unexpected decisions which help to find out the scenarios that were not considered by the researchers; • Remote usability testing: Users are not directly observed while using the assistive solution. Data are gathered and then later analyzed by the researchers; • Interviews: Users are interviewed to share their opinion about the experience of using the solution and its pros and cons. Each of these methods have their drawbacks. According to [87] representative surveys cost money and time, the thinkaloud protocol is not accurate because the environment is not natural to the user and the tasks are usually performed in a controlled environment; field experiments may not represent the correct population; remote testing needs additional tools for collecting data; and interviews do not sufficiently cover usability issues. Additionally, controlled environment testing might not consider some factors that exist in the real environment which may affect the user's experience. Appendix 4 details how user testing was performed in the selected papers. The ones, that are not included, only had performed a technical evaluation.

Conclusions and discussion
In this systematic mapping study, a selection of published papers during the past four and a half years related to computer vision-driven scene understanding for VI/blind people were reviewed. They provide various assistive services for scene understanding, like obstacle detection and object recognition. Obstacle detection can be performed using sensors, cameras or both. Sensors can detect obstacles in shorter distances in comparison with cameras, but they are easier to use and can perform better in some cases. Therefore, some researchers prefer to combine these two approaches.
For object recognition, there are two main ways to process the data. Some of them undertake the process of computing remotely and others locally. Cloud services and remote servers can perform heavy computations and recognize a wider range of objects. However, they require to be constantly connected to the network that manages the computing. This can also cause security problems that should be taken into consideration. Local computing does not have those problems but it offers limited capabilities.
We should highlight that technologies in this field have improved a lot in the past few years. Computer vision is thriving with the appearance of object recognition methods like YOLO, and more advanced solutions have been developed recently that provide higher accuracy and performance.
However, despite the fact that these solutions help users overcome social barriers and give them more independence in life, computer vision models rely on camera input which could threaten the privacy of users and surrounding people. One of the greatest risks is that the collected data get misused, especially in solutions that rely on remote servers for computation instead of the users' devices. Some studies show that there is a trade-off between the provided services and the privacy costs, and that some users are willing to accept the privacy costs in exchange for the service they receive [88]. Lee et al. [89] conducted a study about the social acceptance of assistive solutions for the blind from the perspective of both blind users and bystanders. They concluded that a considerable amount of people in the society are still not very comfortable to be exposed to these devices, specially if they include a camera. Their results indicate that a thorough evaluation in a real environment is needed to evaluate the social acceptance of assistive solutions and the needs of people who are exposed to the technology. The study of Akter et al. [90] shares a similar point of view. Both sighted and nonsighted users in their study were concerned about their privacy and the accuracy of the information provided by the assistive solution.
Another important issue that came up in some of the reviewed solutions is the exclusion of VI/blind people in the evaluation process, an aspect that may be critical for the adoption of the technology.
Verza et al. [91] suggest that some of the reasons for abandoning assistive technologies are neglecting users' opinions in the development process, inefficiency of devices and insufficient training of the user. Furthermore, a research by Phillips and Zhao [92] identified four factors related to the abandonment of assistive devices. They noted that change in user needs and priorities in time is one of the main factors in device abandonment. For instance, according to [93], some changes in VI users, like worsening eye condition due to macular degeneration, can imply a significant change in user needs. Other abandonment reasons mentioned include not considering users' opinion, ease of device procurement and poor device performance.
Additionally, testing of the systems should be performed in a real-world scenario to assess if they are usable out of the laboratory's controlled environment. The final goal of an assistive solution is to improve the lives of disabled end users. However, if the usability or performance of the system is not properly tested in the real context of use, it can end up causing negative impacts on the target end user. Our study suggests that researchers in this field should pay more attention to the VI/blind user needs and the applicability of the solutions. Many of the papers reviewed in our systematic mapping had insufficient/no data regarding the target context of use for the proposed solutions. Consequently, it was not possible to assess the compatibility level of their solutions with the ICF categories and their specific tasks, such as dressing, eating, drinking, preparing meals, acquisition of necessities and so on. There are solutions that prove to be able to detect persons, objects or obstacles, but their expected benefits in the end users' life are vague. This situation raises important generalizability concerns, given that a solution just tested in a toy or simulated scenario might not have the same effectiveness in different use cases or scenarios. The cost and effort associated with the adaptation of a specific solution for its application in a different context is generally overlooked.
Finally, the way that information is represented and delivered to the user is crucial. Users should be able to comprehend the information provided by the solution without complications. In most of the solutions, binaural audio, vibrotactile feedback and basic audio instructions were used for this purpose. However, these might not be the best approaches. For example, binaural audio can approximately indicate the location of an object or an obstacle, but still it is not very accurate.
All mentioned issues may have contributed to the abandonment of existing commercial assistive solutions, leading to the unavailability of them in the market. Paying more attention to the ICF framework, user requirement studies and the UCD methodology would be desirable for developing more usable, useful and better accepted assistive solutions. Besides that, learning from some of the success stories of products like WeWalk [9] that was designed by a visually impaired person known as Kursat Ceylan, who was familiar with the needs of blind people, can lead us to the development of more useful assistive devices.
This study highlighted the main challenges in this rapidly evolving field and compared different approaches of computer vision-based assistive solutions in recent years. The results of this systematic mapping can be valuable for researchers who want to develop more useful and effective assistive solutions for VI/blind people in the future. Real-time performance for detection/recognition tasks 5.Ease of use, natural/intuitive user interface, acceptable by a broad user population, including senior citizens 6.A simple training procedure, potentially scalable to new objects and personalization 7.Tolerance to viewpoint variations 8.Tolerance to illumination variations 9.Tolerance to blur, motion blur, out of focus, and occlusions 10.Providing information on location, guidance and navigation 11.Providing information on the system operation 12.Selection of the device operation mode offline-online 13.Personalization by giving the ability to the users to define their own level of disability 14.Minimize the dangers and errors by preventing consequences of incidental or unintentional activity 15.Sharing information for accompanying contents of surroundings (coffee shops, hotels, hospitals, etc.) 16.A camera for detecting obstacles for also obstacle avoidance (moving and static objects/obstacles' shape, location, moving speed, etc.) 17.Recognizing the color of clothes 18.An affordable price 19.Early alert for obstacles, especially in a waist level 20.Position restore actions when the user gets lost 21.Notification of uneven floor surfaces such as loose street tiles, puddles or other small holes 22.Systems should reliably provide relevant information when needed, while also considering information accuracy 23.The types of obstacles that are communicated to the user should be restricted to those that are unexpected. This is, especially important to limit information overload and reduce system complexity Requirements 24.Different contexts may require different types of user interaction. Environments with many obstacles may require different types of notifications (i.e., more frequent, closer in range) 25.A better detection of horizontal objects, ground and small objects as well 26.Smaller and more efficient device 27.Obstacle recognition after detection (information in the output that would allow the user to distinguish between different types of obstacles) 28.A strong enough vibration signal to indicate an imminent collision 29.Detection of moving obstacles and small objects 30.Accurate voice and language recognition 31.Short sentences to be used as input configuration commands to the assistive mobile application 32.Combination of audio and touch model: Audio navigation commands and vibration alerts for early obstacle warning (2 m distance) with crescent frequency 33.Directions to avoid obstacles with vibration signals or audio guidance 34.Body wearable product (the preferred position is the waist) 35.System run locally on the device without the need for internet connection 36.Light product with small size 37.A clear description of the indoor place to create a mental map of it 38.Triggering/Sharpening the visually impaired user's environmental sensing 39.System should cooperate with human helper that guide the visual impaired

Appendix 1: Techniques used for obstacle detection
Acknowledgements Special thanks to Gofore Spain SL for funding this research. This work was carried out in the context of an industrial PhD, which is a collaboration between Universidad Politécnica de Madrid and Gofore Spain SL.
Funding Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature.

Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.