Multimedia Tools and Applications

, Volume 76, Issue 9, pp 11771–11807 | Cite as

A computer vision-based perception system for visually impaired

Article

Abstract

In this paper, we introduce a novel computer vision-based perception system, dedicated to the autonomous navigation of visually impaired people. A first feature concerns the real-time detection and recognition of obstacles and moving objects present in potentially cluttered urban scenes. To this purpose, a motion-based, real-time object detection and classification method is proposed. The method requires no a priori information about the obstacle type, size, position or location. In order to enhance the navigation/positioning capabilities offered by traditional GPS-based approaches, which are often unreliably in urban environments, a building/landmark recognition approach is also proposed. Finally, for the specific case of indoor applications, the system has the possibility to learn a set of user-defined objects of interest. Here, multi-object identification and tracking is applied in order to guide the user to localize such objects of interest. The feedback is presented to user by audio warnings/alerts/indications. Bone conduction headphones are employed in order to allow visually impaired to hear the systems warnings without obstructing the sounds from the environment. At the hardware level, the system is totally integrated on an android smartphone which makes it easy to wear, non-invasive and low-cost.

Keywords

Obstacle detection BoVW / VLAD image representation Relevant interest points A-HOG descriptor Visually impaired people 

1 Introduction

Recent statistics of the World Health Organization (WHO) [51] have shown that in 2012 about 0.5 % of the world population is visually impaired. Among this population, 10 % of the concerned people are completely blind. The independent navigation in outdoor environments is of extreme importance for visual impaired (VI) people. In order to perform daily activities the VI attempt to memorize all the locations they have been through, so they can recognize them afterwards.

In an unknown setting, VI people rely on white canes and walking dogs as a primary assistive device. Although the white cane is the simplest and cheapest mobility tool, it has a restricted searching range and cannot infer additional information such as: the speed and nature of the obstacle a VI user is encountering or the distance and time to collision. On the other hand, the walking dogs are highly expensive, require an extensive training phase and are effectively operational for about 5 years only [7]. Both the white cane and guide dog provide short range information and cannot detect overhanging obstructions.

The task of route planning in an unforeseen obstacle environment can severely impede the independent travel of VI and thus reduce their willingness to travel [28].

In this context, in order to improve cognition and assist the navigation of VI users, it is necessary to develop real-time systems, able to provide guidance and to recognize both static and moving objects, in highly dynamic and potentially cluttered urban scenes. The goal of such a technology is not to replace the white cane, but to complement it in an intelligent manner, by alerting the user of obstacles in a few meters or by providing direction/localization information. For acceptability reasons, a major constraint is imposed to the system: it should not interfere with the other senses as acoustic or haptic.

Motivated by the above-mentioned considerations, in this paper, we introduce a novel VI-dedicated navigational assistant, developed using the computer vision techniques that can be used for improving or complementing the human senses (i.e., acoustic or haptic). The approach is designed as a real-time, standalone application, running on a regular smartphone. In this context, our system provides a low-cost, non-intrusive and simple device for VI navigation.

The rest of the paper is organized as follows. Section 1 presents a state of the art review (Section 1.1) and also introduces the general framework of our developments, which have been made within the context of the AAL (Ambient Assisted Living) European project ALICE (Section 1.2). The system includes a novel obstacle detection and classification module (Section 2), a landmark recognition module (Section 3) and an indoor detection/localization of interest objects (Section 4). Section 5 presents the experimental results conducted with real VI users traveling in challenging environments, with various arbitrary moving objects and obstacles. Finally, Section 6 concludes the paper and opens some perspectives of future work.

1.1 Related work

In the last couple of years, various assistive technologies for blind and visually impaired people have been proposed.

Existing commercial approaches exploit the Global Positioning System (GPS), in order to provide guidance and to localize a VI person. However, in the context of people with special needs such systems prove to be sensitive to signal loss and have a reduced accuracy rate in estimating the user position [56]. In urban areas with high density of buildings, the GPS sensors can offer an accuracy error of about 15 m [56]. Moreover, the GPS signal can be frequently lost in certain areas. Such limitations strongly affect the reliability of the proposed systems and severely penalize the GPS-based approaches within the context of VI navigation applications.

Due mostly to the high limitations related to the required computational power and the lack of robustness of the vision algorithms, until recently, computer vision techniques have been relatively poorly used for developing VI-dedicated mobility assistants.

Nevertheless, in the past years, significant advances in computing and vision techniques have been achieved. It is now possible to run in real-time reliable algorithms on embedded computers and even smartphones that are equipped with powerful, multi-core processors. In addition, the computer vision systems, unlike ultrasonic, infrared or laser technologies offer superior level of reproduction and interpretation of real scenes, at the price of a higher computational complexity. Let us now describe and analyze the existing state of the art systems, with main features, advantages and limitations.

1.1.1 CCD camera systems

The tactile vision system (TVS) firstly introduced in [33] is designed as a compact, wearable device, able to detect in real-time obstacles and provide directional information during indoor navigation. The alerting messages are sent to the VI by using fourteen vibrating motors attached to a flexible belt. In this way, the hands and ears free conditions are always satisfied. However, the system is not able to differentiate between ground and overhead obstacles.

The NAVI navigational assistant dedicated to obstacle detection introduced in [61] is composed of a processing unit, a regular video camera, stereo headphones and a supporting vest. The camera captures a gray-scale, re-sampled video stream. Then, by using a fuzzy neural network the system discriminates objects (foreground elements) from the background. The framework is operational in real-time. However, no audio feedback is sent to the user and no information about the distance to the object is provided.

In [71], authors introduce an obstacle detection system in order to determine a safe walking area, in an outdoor environment, for a VI user. By applying the Canny-Edge detection algorithm, vanishing points and sidewalk boundaries are extracted. Next, obstacles are identified by applying Gabor filters in order to determine quasi-vertical lines. The system is easy to wear, light and compact. The major drawbacks are related to its sensitivity to the user movement and the violation of the ear-free constraint. In addition, the solution has never been tested on in real life conditions.

The SmartVision system [34] is designed to offer global indoor and outdoor navigation guidance (using GIS with GPS) with the avoidance of static and dynamic obstacles. Diagonally distributed sidewalk boundaries are extracted to determine the path area. Then, the objects inside the path area are detected by quasi-vertical edges or changes in texture patterns. The system is sensitive to GPS signal lost, initial path positioning, when the VI user is leaving the path or at intersections or crossings.

In [42], by using a regular camera mounted on the user’s waist, the authors introduce an obstacle detection system able to differentiate between background and foreground objects as an image re-sampling process. So, as an inhomogeneous re-sampling process, the background edges are sub-sampled while the obstacle edges are oversampled in the top-view domain.

To our very best knowledge, the only system designed to incorporate a navigational assistant on a regular smartphone is proposed in [52]. By using computer vision techniques (color histograms and object edge detection), the prototype is able to detect with high confidence objects situated at arbitrary height levels. However, the evaluation was performed solely in indoor spaces and with no VI users. In addition, the hand-free condition [46] imposed by the VI user is violated because the smartphone needs to be hand-hold.

A regular camera is more compact and easier to maintain than stereo cameras. However, it is much difficult to estimate the distance or to distinguish between background and foreground objects. Despite the efforts made to detect obstacles from a single image without depth cues, the appearance and geometry models used in these systems are valid only in limited scenarios.

1.1.2 Stereo camera systems

Stereo cameras are more popularly used for building mobility aid systems, because depth can be computed directly from pairs of stereo images.

By using electro-tactile stimulation, GPS localization and visual sensors, the electronic neural vision system (ENVS) introduced in [48] is designed as a real-time application that facilitates the navigation of VI and also alerts user of potential hazards in his way. The warning messages are transmitted to VI by electrical nerve stimulation gloves. In this case, the user hands are always occupied. Moreover, the ground and overhead objects are not detected, while the walking path needs to be flat, which limits the domain of applicability of the method.

The navigation assistant Tyflos, firstly introduced in [14] and extended in [15] is designed to detect surrounding obstacles. The system is composed of two video cameras for depth image estimation, a microphone, ear headphones, a processing unit and a 2D vibration vest. The architecture satisfies the hand-free constraints and the VI can be alerted about obstacles situated at various levels of height. However, the necessity of wearing a vibration vest situated near the skin makes the entire framework invasive.

A wearable stereo device for indoor VI user navigation is proposed in [60]. The system is composed on a processing unit, a stereo camera and a chest mounted harness. The algorithm yields solely a metric map, which is difficult to exploit by blind people. Furthermore, the system is not able to perform in real-time.

A stereo based navigational assistant device for VI is introduced in [54]. The system offers obstacles detection and feature localization capabilities using stereo reconstruction. The video cameras are head mounted and the warnings are sent through vibro-tactile stimulation. The micro vibration motors are put on a vest situated near the skin which makes the entire framework invasive.

A stereo vision aerial obstacle detection system is introduced in [59]. The method develops a 3D map for outdoor environment of the user vicinity, predicts the VI motion using 6DOF egomotion algorithms and evaluates possible aerial obstacles in the next pose. Nothing is said about the acoustic feedback sent to the user and no information about the distance to the object is provided.

The stereo vision system introduced in [12] detects obstacles by threshold segmentation of the scene saliency map. The 3D information of the obstacle is also computed. Finally, voice messages are transmitted to the VI user.

Although many stereo-vision-based systems were introduced as [12, 14, 15, 48, 54, 59, 60], some inherent problems still need to be solved. First, because of the incorrect estimation of large depth cues the stereo-matching algorithm fails especially for less textured regions. Second, the quality and accuracy of the estimated depth map is sensitive to artifacts in the scene and abrupt changes in the illumination.

1.1.3 RGB-D camera systems

More recently, the emergence of RGB-D cameras enabled the apparition of a new family of VI guidance systems exploiting such technologies. An RGB-D camera provides a depth map of the whole scene in real-time as well as a RGB color map. Therefore, it may be used conveniently for both object detection and scene understanding purposes.

By using a Kinect combined with a depth sensor and acoustic feedback, the KinDetect system [35] aims at detecting obstacles and humans in real time. Obstacles situated at the level of the head or feet can be identified by processing the depth information on a backpack computer. However, by using regular headphones to transmit acoustic warnings the user ears are always occupied.

The system introduced in [64] can recognize a 3D object from depth data generated by the Kinect sensor. The VI users are informed not only about the existence of the detected obstacles, but also about their semantic category (chairs and upward stairs are here supported). In a similar manner, the framework proposed in [8] identifies nearby structures from the depth map and uses audio cues to convey obstacle information to the user.

In [56], 3D scene points recovered from the depth map (in indoor and outdoor scenarios) are classified as either in the ground or on the object by estimating the ground plane with the help of a RANSAC filtering technique. A polar accumulative grid is then built to represent the scene. The design is completed with acoustic feedback to assist visually impaired users. The system was tested on real VI users and satisfies the hands-free and ear-free constraints.

The RGB-D camera depends on emitted infrared rays to generate a depth map. In outdoor environments, the infrared rays can be easily affected by sunlight. Therefore, guidance systems developed using the RGB-D camera can only be used in indoor environments, which limits the range of its use in a mobility aid system. Moreover, due to the limited computational resources, the development of accurate and dense depth maps is still expensive.

The analysis of the state of the art shows that the existing systems have their own advantages and limitations but do not meet all the features and functionalities needed by VI users. Existing systems focus on automatic obstacle detection, without proposing a joint detection/recognition that can provide a valuable feedback to the users with respect to the surrounding environment. In addition, motion information, which is essential for the comprehension of the scene dynamics, is not available. Concerning the localization capabilities, most of techniques still rely on GPS approaches, in the case of outdoor positioning, or include a priori knowledge (e.g., detailed maps) for more specific indoor scenarios.

In recent years, the emerging deep learning strategies showed promising results in various computer vision/image classification areas, including object detection and recognition and 2D/3D scene reconstruction.

1.1.4 Emerging deep-learning strategies

The deep learning approach can be interpreted as a method of hierarchical learning that uses a set of multiple layers of representation to transform data to high level concepts [23]. At each individual layer of transformation, higher level features are derived from the lower level features, leading to a hierarchical representation of information. Let us analyze how the emerging deep learning approaches have been considered for VI-dedicated applications.

In [69], a system is introduced for automatic understanding and representation of image content. The system relays on a generative model that updates the system memory based on a non-linear function and on an image vocabulary determined using Convolution Neural Networks (CNN). Even though the systems returns good results on traditional image dataset as PASCAL, Flicker30k or SBU no experimental results were conducted on video datasets. Furthermore, the authors provide no information regarding the computational complexity which is of crucial importance when considering embedded VI applications.

In [70], the authors introduce a navigation assistant for an unknown environment based on SLAM estimation and text recognition in natural scenes. The system improves the traditional SLAM camera pose and scene estimation by integrating real-time text extraction and representation for quick rejection of candidate regions. The text extraction method is based on deep learning network, trained with a 107 low-resolution input examples.

A mobility assistant for VI users, completely integrated on a mobile device, with face detection, gender classification and sound representation of images is introduced in [10]. The system uses the video camera integrated on the smartphone to capture pictures of the environment in order to detect faces by using CNN. Then, for gender estimation the authors use Logistic Regression as the classification method on feature vectors from the CNN. However, no validation of the system with real VI users has been considered. In addition, and no information regarding the transmission of acoustic signals to users is presented.

Concerning the object detection techniques, various deep learning-based approaches have been introduced [21, 26, 63]. In [63] the authors propose replacing the last layer of a deep convolution network with a regression layer in order to estimate the object’s bounding box, while in [26] a bottom-up object detection system based on deep model is introduced. Similarly, in [21] a saliency-inspired deep neural network is proposed.

The deep learning approaches offer highly interesting perspectives of development. The main issue that still needs to be solved relates to the computational complexity that is conditioning the relevance of such approaches within the context of real-time, embedded VI-dedicated systems. The emergence of new smartphones equipped with powerful graphical boards (e.g. NVIDIA TX1) may offer a solution to this problem and enable in the future the deployment of such deep learning techniques for real-time object detection and classification purposes, within various contexts of application.

As a general conclusion of the above state of the art methods we can say that the difficulty is not developing a system that has all the “bells and whistles” but to conceive the technology that can last in time and be useful. For the moment, the VI users cannot be completely confident about the robustness, reliability or overall performance of the existing prototypes. Any new technology should be designed not to replace the cane or the walking dog, but complement them by alerting the user of obstacles in a few meters, and provide guidance.

Our work has notably been carried out within the framework of the European project ALICE (www.alice-project.eu), supported by the AAL (Ambient Assisted Living) program. The ALICE project had as ambitious objective the development of a VI-dedicated navigational assistant. The main features of the ALICE navigational assistant are briefly described in the following section.

1.2 The ALICE framework

ALICE aims at offering to visually impaired users a cognitive description of the scenes they are evolving in, based on a fusion of perceptions gathered from a range of sensors, including image/video, GPS, audio and mobile-related. The ALICE system, illustrated in Fig. 1, is composed of a regular smartphone attached to a chest mounted harness and bone conduction headphones.
Fig. 1

The hardware components of ALICE device

The harness has two major roles: it makes it possible to satisfy the hands-free requirement imposed by the VI and improves the video acquisition process, by reducing the instabilities related to cyclic pan and tilt oscillation. The system can be described as a wearable and friendly device, ready to use by the VI without any training.

The proposed solution is low-cost, since it does not require any expensive, dedicated hardware architecture, but solely general public components available at affordable prices on the market.

In addition, the system is also non-intrusive, satisfying the hands-free and ears-free requirements imposed by VI users.

The main functionalities offered by ALICE are the following:
  • Real-time detection of obstacles and moving objects (cars, pedestrians, bicycles),

  • Automatic identification of crossings, traffic lights…

  • Landmark recognition and specification of annotated itineraries,

  • Precise localization through enhanced GPS navigation techniques,

  • Adapted human-machine interfaces: non-invasive feedback with minimum verbalization,

  • and enactive, earconic/haptic signals.

Within this framework, our developments notably concern the computer vision-related capabilities integrated in the ALICE device (Fig. 2). The following section summarizes the proposed contributions.
Fig. 2

Computer vision capabilities of the ALICE system

1.3 Contributions

The main contributions presented in this paper concern:
  • an obstacle detection method (Section 2.1), based on apparent motion analysis. The originality of the approach comes from the whole chain of apparent motion analysis proposed. Semi-dense interest point extraction, motion-based agglomerative clustering, and motion description are the key ingredients involved at this stage. The method makes it possible to reliably detect both static and dynamic obstacles, without any a priori knowledge of the type of object considered. Moreover, the motion analysis makes it possible to acquire useful information that is exploited for prioritizing the alerts sent to the user.

  • an object recognition/classification approach (Section 2.2), which introduces the concept of relevant interest points extraction, adaptive HOG descriptors and shows how it can be exploited in a dedicated BoVW / VLAD image representation. The strong point of the method relates to its ability of dealing with multiple categories, without need of using different templates and/or sliding windows. The object detection and recognition method is able to run in real-time.

  • a landmark recognition module (Section 3), dedicated to the improvement of the GPS accuracy localization through computer vision methods. The main originality of the approach concerns the two-step matching procedure, which makes it possible to benefit from both FLANN matcher’s speed search and the BruteForce matcher’s consistency. Remarkably, the system can work entirely on a smartphone in off-line mode, with times of response to the queries of about 2 s (and for 10 to 15 landmarks).

  • an indoor object localization module (Section 4). The user has the possibility to pre-learn a set of objects of interest. At the run-time stage, a detection and tracking technique makes it possible to detect and identify such objects in cluttered in-door scenes. The main contribution concerns the spatial layout verification scheme which makes it possible to achieve robustness without impacting the computational burden.

All the methods involved were specifically designed and tuned under the constraint of achieving real-time processing on regular smartphones. To our very best knowledge, no other state of the art approach can offer such a complete set of computer vision functionalities dedicated to VI assistance and adapted to real-time processing on light devices.

The validation of the proposed methodology is presented in Section 5. We have objectively evaluated each of the involved methods on ground truth data sets and with recognized performance measures (Sections 5.1, 5.2 and 5.3).

Let us first detail the obstacle detection/recognition approach, which is the core of the proposed methodology.

2 Obstacle detection and classification

When navigating in urban environments, the user can encounter a variety of obstacles, which can be either static (e.g., objects in the street that can cause injuries and should be avoided) or dynamic (e.g., other pedestrians, vehicles, bicycles…). In order to ensure a safe navigation, it is important to be able to detect such obstacles in real-time and alert the user. Our approach makes it possible to both detect such elements and to semantically interpret them.

Let us first describe the obstacle detection method proposed.

2.1 Static and dynamic object detection

We start by extracting interest points regularly sampled over the video frame. Let us mention that we have also considered more powerful, content-related interest points extractors as SIFT [44] and SURF [5]. However, we have empirically observed that generally in outdoor environments the background has a significantly higher number of interest points than the one corresponding to the obstacles/objects.

Gauglitz et al. presents in [25] a complete evaluation of interest point detectors (i.e. corner, blob and affine invariant detector) and feature descriptors (i.e. SIFT, PCA-SIFT, SURF) in the context of visual tracking. The evaluation is performed on a video dataset of planar textures, which include inconsistent movement, different levels of motion blur, geometric changes (panning, rotation, perspective distortion, zoom), and lighting variation (static and dynamic). In the context of obstacle detection the following conclusions can be highlighted:
  • The execution time when tracking interest points between consecutive frames and random frame pairs is important (100 ms for SIFT).

  • For increased temporal distance between images, the repeatability of all detectors decreases significantly, which makes a problem for object tracking objectives.

  • Large camera motion leads to strong variations in the background. Consequently, the neighborhood interest points between adjacent images can significantly be different.

  • None of the detectors cope well with the increased noise level of the darker static lighting conditions.

  • Furthermore, in the case of low resolution videos or for less textured regions SIFT or SURF detectors extract a reduced number of interest points.

Based on this analysis, we have privileged a semi-dense sampling approach, which fits the computational complexity requirements without degrading the detection performances. A uniform grid is constructed. The grid step is defined as: Γ = (W∙H)/ Npoints, where W and H are the dimensions of the image and Npoints is the maximum number of points to be considered.

The value of parameter Npoints determines a trade-off between detection accuracy and computational speed. In our case, a good compromise has been achieved for a value of Npoints set to 1000 interest points, for videos acquired at a resolution of (320 × 240 pixels).

In order to identify static or dynamic obstacles, we consider a motion-based analysis approach. Thus, the objective is to determine all objects that exhibit an apparent motion different from the background. This makes it possible to identify moving objects, but also static objects (e.g., obstacles) that appear in the foreground while the user is moving within the scene.

First, we need to determine the interest point displacements (e.g., 2D apparent motion vectors) between successive frames. To this purpose, we retained the multiscale Lucas-Kanade algorithm (LKA) [45]. The main limitations of the LKA come from the brightness and spatial inconsistency.

Let us note that more recent methods such as [6] are able to increase the estimation accuracy and become robust to abrupt changes in the illumination. However, in our case where the computational burden is an important constraint, we cannot adopt this strategy. Thus, we prefer to exploit a “relatively good” estimation of the motion vectors, rather than a highly accurate one that can collapse the real-time capability. As we will see in the following, when combined with a motion clustering approach, this is sufficient to ensure high detection performances.

The LKA tracking process is initialized with the set of interest points of the uniform grid considered. Then, these points are tracked between successive images. However, in practice the LKA cannot determine a trajectory for all interest points (e.g., when the video camera is moving, obstacles disappear or other/new objects appear). So, when the density of points in a certain area of the image is inferior to the grid resolution, we locally reinitialize the tracker with points from the grid. The new points are then assigned to existing objects.

Let us denote by p1i (x1i,y1i) the ith keypoint in the reference image and p2i(x2i, y2i) the correspondent one, determined with the LKA in the successive frame. The associated motion vector vi = (vix, viy) is also expressed in polar coordinates, with angular value θi and magnitude Di.

The availability of the motion vectors makes it possible to determine first the global motion of the scene, modeled as a homographic transform between successive frames.

2.1.1 Global motion estimation

We robustly determine the global homographic transform (H) between adjacent frames with the help of the RANSAC (Random Sample Consensus) [38] algorithm. If we consider a reference point expressed in homogenous coordinates p1i = [x1i, y1i, 1]T, and a homographic matrix H, we can estimate the novel position of the point pest2i= [xest2i, yest2i, 1]T in the successive frame.

For each interest point, we compute the L2 distance between the estimated position pest2i and the tracked position p2i of that interest point (determined using the LKA):
$$ E\kern0.5em \left({p}_{1i},H\right)=\left\Vert {p}_{2i}^{est}-{p}_{2i}\right\Vert $$
(1)

In order to determine the background interest points, we compare E (p1i, H) to a predefined threshold ThBG. The interest points satisfying this condition are marked as inliers (i.e., belonging to background) while the outliers represent keypoints associated to different moving objects existent in the scene (i.e. foreground objects). In our experiments we fixed the background/ foreground separation threshold ThBG to 2 pixels.

In outdoor scenes, multiple moving objects can be encountered. For this reason, we focused next on the detection of foreground objects.

2.1.2 Foreground object identification

Let us note that, due to the foreground apparent motion, even static obstacles situated in the foreground can act like moving objects relatively to the background. So, we further cluster the set of outlier points in different classes of motion. To this purpose, we exploit an agglomerative clustering technique described in the following.

The principle consists of considering first each interest point as an individual cluster. Then, adjacent clusters are successively merged together based on a motion similarity criterion. The operation stops when no interest point left satisfies the similarity constraint. The sensitivity of the method is notably determined by the considered similarity measure between interest point motion vectors assigned to different clusters. In this paper, we propose the following strategy:
  • Phase I – Construct the frequency of apparition of the motion vectors angular coordinates. To this purpose, the angular coordinates are represented as integer degrees (from 0° to 360°). An arbitrary chosen interest point in the set of points with the highest represented angular value will determine a new motion cluster MCl. Let θ(MCl) denote its angular value.

  • Phase II – For all the keypoints not yet assigned to any cluster, compute the angular deviation by taking as reference the centroid value:
    $$ \delta \kern0.5em \left({\theta}_i,\kern0.5em \theta \kern0.5em \left(M{C}_l\right)\right)=\left|{\theta}_i-\theta \kern0.5em \left(M{C}_l\right)\right| $$
    (2)

If the angular deviation δ (θi, θ (MCl)) is inferior to a predefined threshold ThAD and if the corresponding motion magnitudes are equal, than the ith point is assigned to the (MCl) cluster. Let us note that the motion magnitude values are here quantized to their nearest integer value, for comparison purposes. For the remaining outlier interest points, the process is repeated recursively until all the points are assigned to a motion class. In our experiments, we set the grouping threshold ThAD to 15°.

To each motion cluster, a centroid point is assigned to, defined as the point in the considered cluster with the median value (over the set of all points assigned to the given cluster) of the corresponding motion vector angular coordinate.

A final stage, the k-NN clustering algorithm [73] is applied, in order to verify the spatial consistency of the determined motion classes. Thus, we determine for each point its associated kNN neighbors using the Euclidian distance (between their corresponding spatial positions). If at least half of the detected points do not belong to the same motion class, we consider that the point’s assignment to the present cluster is due to an error into the grouping process. Consequently, the point is removed from the motion class and assigned to the background.

In cluttered outdoor environments objects can disappear, stop or be occluded for a period of time. In such situations, incomplete trajectories can be obtained or even worse, the same object can be identified as a new entity in the scene. In order to deal with such situations, the object detection process is reinforced with a multi-frame, long term fusion scheme. By saving the object location and its average velocity within a temporal sliding window of size Twindow, we can predict its novel global motion compensated as described in Eq. (3):
$$ {p}_i\kern0.5em \left({t}_j\right)={p}_i\kern0.5em \left({t}_{j-1}\right)+\frac{1}{T}{\displaystyle \sum_{k=1}^T{v}_i\kern0.5em \left({t}_{j-k}\right)}-{p}_i^{est}\left({t}_j\right) $$
(3)
where pi(tj) is the ith interest point at frame tj, vi is the motion vector velocity and piest(ti) the point estimated location given by the camera motion obtained by applying eq.1 to pi(tj-1) with the current homographic matrix H.

So, when a previously detected object is occluded, a discontinuity in its trajectory will be determined. By estimating the object’s position, in frames where it is hidden, we can determine, at the moment of time when it reenters into the scene that it corresponds to an obstacle already detected.

Once the obstacles are identified, we determine their degree of danger and classify them accordingly, as described in the following section. Let us underline that no a priori knowledge about the size, shape or position of an obstacle is required.

2.1.3 Motion-based object description

This stage is performed by using the object position and motion direction, relative to the VI user. For each motion cluster, the analysis is performed on the corresponding centroid point previously determined. A global motion compensation procedure is first applied, in order to characterize the centroid movement independently of the camera movement.

Let P1 denote the centroid position at frame N and P2 its motion-compensated position at frame N + 1. We considered a reference point P0 as the camera focus of attention defined by convention in the middle of the bottom row of each frame (Fig. 3).
Fig. 3

Obstacle direction estimation

Then, we compute the object’s angular displacement α = \( \widehat{P_0{P}_1{P}_2} \) (Fig. 3). The object is labeled as approaching (AP) if the angle α is inferior to a specified threshold (ThAP/DE). Otherwise, the subject is considered as moving away from the obstacle or that the object is departing (DE). The ThAP/DE parameter helps to perform a first, preliminary classification based on the degree of dangerousness of various objects existent in the scene. A higher value of ThAP/DE threshold will signify that a larger set of obstacles will be considered as approaching the user and vice-versa.

Because the human body might shake or slightly rotate in time we included a reinforcement strategy based on motion consistency in time. So, by saving the object directions α within the temporal sliding window of size Twindow (cf. Section 2.1.2), we can predict and verify its novel directions relative to general object movement. The AP/DE decision is then taken as the majoritary label detected in the considered sliding window. In our experiments, we have tested with values of the ThAP/DE parameter in the interval [32, 40] and obtained equivalent performances (±5 % of objects detected as approaching/departing). We finally set ThAP/DE to 45°, which leads to reasonable performances in a majority of situations.

We propose to use a trapezium region projected onto the image in order to define the user’s proximity area.

We used for video acquisition the camera embedded on a regular smartphone with an angle of view α = 69°. The smartphone is attached to the user at an average elevation (E) of 1.3 m (meters).

For the trapezium, the height is set to a third of the total image height (i.e., ST segment in Fig. 4). We can establish the distance from the user and the bottom down pixel in the trapezium as the RT segment (Fig. 4):
$$ RT=E/tg\left(\alpha /2\right)=1.85m $$
(4)
Fig. 4

Real distance estimation

Nevertheless, the size of the trapezium can be adjusted in a pre-calibration step by the user. A warning message will be generated only for obstacles situated at maximum distance relative to the user of about five meters:
$$ RN=RT+TN=1.85+2\cdot E/tg\left(\alpha /2\right)\cong 5.5m $$
(5)

An obstacle is marked as urgent (U) if it is situated in the proximity of the blind/visual impaired person. Otherwise, if located outside the trapezium, the obstacle is categorized as non-urgent or normal (N). However, by employing the proximity area we can prevent the system to continuously warn the subject about any object existent in the scene. A warning can be launched just for objects situated in the urgent region.

The downside of this assumption is given by the rejection of warnings for dynamic objects (e.g., vehicles) approaching the user very fast or for obstacles situated high at the head level, such as tree branches, arcades or banners. To avoid such situations it is necessary to distinguish and recognize the various types of objects. Using this information we can then generate warnings for objects situated outside the proximity trapezium, whenever such action is required. To achieve this purpose, we propose an obstacle recognition/classification method, further described in Section 2.2.

Let us underline that the above-described approach depends poorly on the technical characteristics of the considered smartphone and can be easily implemented on various mobile devices. In our case, we have successfully tested the approach on multiple devices ranging from LG G3, HTC one and Samsung Galaxy S2, S3 and S4. Actually, ALICE can be optimally integrated on any mobile device running Android as operation system, with a processor superior to 1.3GHz and 2GB of RAM. Regarding the angle of view of a video camera embedded on a regular smartphone this is superior to 600 in most of the cases.

Let us also note that the VI user height is a parameter that has little effect on the overall system performance. Our only constraint is to attach the smartphone at an average elevation of about 1.3 m (meters) which can be adjusted with the help of the chest mounted harness. If this constrains cannot be satisfied then the static obstacles will be detected at a maximum distance relative to user smaller than five meters. For dynamic objects the smartphone elevation has no effect over the detection efficiency.

2.2 Obstacle classification

Each frame of the video stream can be considered as a hierarchical structure with increasingly higher levels of abstraction. The objective is to capture the semantic meaning of the objects in the scene. In this framework, we have considered the following four major categories: vehicles, bicycles, pedestrians and static obstacles. A training dataset has been constituted for learning purposes, extracted from the PASCAL repository [22]. The training set includes 4500 images as follows: 1700 vehicles, 500 bicycles, 1100 pedestrians and 1200 static obstacles. Some samples are illustrated in Fig. 5.
Fig. 5

Samples from the considered training set

The considered categories were selected according to the most often encountered obstacles in outdoor navigation. When creating a vocabulary of visual words an important concern is the size and the choice of data used to construct it. As indicated in the state of the art [29], the most accurate results are obtained when developing the vocabulary with the same data source that is going to appear in the classification task. However, in the ALICE framework, the categorization phase can be considered as a more focused task of recognition of a relatively reduced set of specific objects, rather than a generic classification task. Because the number of categories is known in advance, we have sampled the descriptors from the training data to have a good coverage over all the 4 considered classes. To this purpose, we used 3300 image patches selected from the PASCAL corpus, enriched with 1200 image patches representing: fences, pylons, trees, garbage cans, traffic signs, overhanging branches, edge of pavements, ramps, bumps, steps… selected from our own dataset. We can naturally expect that an increase of the training dataset (e.g. Imagenet database) can: (1) enhance the performances of the classification stage, and (2) be useful for a finer categorization into an extended number of categories (notably by refining the obstacle class into sub-categories). However, for the time being we focused solely on the 4 categories retained and thus considered a reduced training set. In order to deal with the situation when the training images do not perfectly match the images captured in real life, we introduce an extra Outlier category, that gathers the image patches that cannot be reliably assigned to one of the 4 categories.

The proposed obstacle classification framework is illustrated in Fig. 6.
Fig. 6

Obstacle recognition/classification

  1. (1)

    Firstly, for each image in the dataset low level image descriptors are extracted (i.e., relevant interest points or A-HOG).

     
  2. (2)

    Then, by using these descriptors an unsupervised learning step is performed in order to create a vocabulary. Each descriptor is mapped to the nearest word in order to develop a global image representation. Two different approaches are here retained. The first one concerns the Bag of Visual Words (BoVW) representation built upon low level descriptors, while the second adopts the Vector of Aggregated Local Descriptors (VLAD) methodology.

     
  3. (3)

    In the final stage the image patch classification is performed. Here we adopted a strategy based on Support Vector Machines (SVM) with Radial Basis Functions (RBF) kernels.

     

2.2.1 Feature extraction and description

Relevant interest points

All images in the dataset are mapped onto a common format by resizing them to a maximum of 16 k pixels, while preserving the original aspect ratio. For each image, we extract interest points using pyramidal FAST [67] algorithm. Then, we have privileged a simple, semi-dense sampling approach, which fits the computational complexity requirements without degrading the retrieval performances.

After the interest points are extracted we overlap a regular grid onto the image. Then, we propose to characterize each rectangle of the grid using only one key-point. Instead of applying a traditional interest points selection strategy (most often, the center of each grid cell), we propose to determine which points are within the rectangle, filter and rank them based on the Harris algorithm [31] and then select the most representative one (the one with the highest value of the Laplacian operator). The process is controlled by the grid step parameter, defined as: Γ = (W∙H) / Npoints, where W and H are the dimensions of the image and Npoints is the maximum number of points.

Figure 7 illustrates the results obtained on a given image patch with the proposed interest point extraction method (Fig. 7c), the traditional FAST algorithm (Fig. 7a), and the FAST points filtered with the Harris-Laplacian procedure (Fig. 7b).
Fig. 7

Interest point extraction using: a. traditional FAST, b. FAST filtered using Harris-Laplacian, c. proposed technique

Let us underline that the same number of interest points is retained for the examples in Fig. 7b and c. As it can be observed, the set of interest points obtained after applying the proposed method is more uniformly distributed within the image patch and thus potentially better suited to capture the underlying informational content. The retained interest points are further described using the SIFT descriptor [44].

Let us note that the technological advances of mobile devices, have led in recent years to a variety of dedicated light-weight, binary interest point descriptors, adapted to the computational capacities of such devices. Among the most representative approaches, let us mention the BRISK (Binary Robust Invariant Scalable Keypoints) [40], BRIEF (Binary Robust Independent Elementary Features) [39], ORB (Oriented FAST and Rotated BRIEF) [58], and FREAK (Fast REtinA Keypoint) [1] descriptors. Apart from their fast extraction procedures, such descriptors offer very quick matching time with the help of distance measure adapted for binary values, such as the Hamming distance. This actually represents the biggest advantage of binary descriptors, as it replaces the more costly Euclidean distance. The major drawback of such approaches concerns the matching performances that are significantly lower than those of the SIFT descriptor, in particular when the scale variations of objects become significant (less than 0,5 or greater than 2) [1].

For this reason, in our work we have retained the standard SIFT descriptors which ensure high matching performances, reproducibility and scale invariance at a computational cost that can still fulfill the real-time constraint required by our application.

Adaptive HOG descriptor extraction

In the traditional HOG [16] approach, an image I(x,y) is divided into a set of non-overlapping cells. For each cell, a 1D histogram of gradient directions is computed. Next, a normalization step is performed on every cell block in order to form a histogram that is invariant to shadows or changes in illumination.

The traditional HOG descriptor was initially developed for human detection. In [16], authors propose using an analysis window of (64 × 128 pixels) for an accurate localization and recognition of pedestrians.

Some improvement are proposed in Rosa et al. [57] where the HOG descriptor is used for people detection by employing detection windows of fixed resolution (64 × 128 pixels or 64 × 64 pixels). In our case such approach is obsolete because our system is designed to detect objects with a high variability of instances and shapes.

Directly applying the traditional HOG method to our case would require constraining the size of the image patch (extracted using the obstacle detection method described in Section 2.1 and representing the object’s bounding box) to a fixed resolution. However, a fixed resolution of the analysis window would alter significantly the aspect ratio of the patch and thus the corresponding HOG descriptors, with an impact on their discriminative power (Fig. 8b). Consequently, the system will return high recall rates only for the pedestrian’ classification.
Fig. 8

Low-level descriptor extraction: a image patch at the original resolution; b traditional HOG, c adaptive HOG (A-HOG)

Different research works [17] propose solutions for overcoming this limitation. The principle consists of modifying the size of the patch to a pre-established value, appropriate for each category (e.g., for bicycles 120 × 80 pixels, for cars 104 × 56 pixels). Even so, the high variability of instances considered in our case, makes it impossible to select a specific resolution adequate for each element (e.g., garbage cans, traffic signs). Moreover, because of the real-time constraint of our application it is intractable to use a multiple window size decision approach.

In order to avoid such limitations, we introduce a novel version of HOG descriptor denoted adapted HOG (A-HOG). The A-HOG approach dynamically modifies the patch resolution but conserves its original aspect ratio (Fig. 8c). We also limit the maximum number of cells for which we extract the descriptor. So, when the algorithm receives as input a new image patch it starts by computing its associated aspect ratio. Then, it modifies the patch height and width with:
$$ w=\left[\sqrt{ar\cdot ncell\cdot csize}\right];\;h=\left[w/ar\right] $$
(6)
where ar is the patch aspect ratio, ncell is the maximum number of cells for which we extract the HOG descriptor and csize is the dimension of a cell.

The size of the patch is adapted in such a way to meet both requirements: conserving the initial aspect ratio and matching the fixed number of cells imposed.

Whatever the low-level descriptors considered, either FAST or A-HOG, an aggregation procedure is required in order to obtain a global image representation.

2.2.2 Global image representation

In order to represent an image as a high dimensional descriptor we evaluated two methods as described below: BoVW and VLAD.

BoVW image representation

In the classical BoVW (Bag of Visual Words) [13] framework, each image in the dataset can be described as a set of SIFT (associated to the detected interest points) or A-HOG descriptors Di= {di1, di2, …, din}, where dij represents the jth descriptor of ith image (i.e., descriptor associated to the jh interest point/ cell of image Ii) and n is the total number of interest points /cells in an image.

The obtained visual words are clustered with the help of the k-means algorithm [19], which makes it possible to obtain a codebook W.

An arbitrary visual word descriptor dij can be then mapped onto its nearest prototype w(dij) in the vocabulary W:
$$ w\left({d}_{ij}\right)= \arg \underset{w\in W}{ \min }{\left\Vert \kern0.5em w-{d}_{ij}\kern0.5em \right\Vert}_{\kern0.5em 1} $$
(7)
where ‖. ‖ 1 denotes the L1 norm in the descriptor space.
This procedure makes it possible to represent each image in the dataset as a histogram of visual words. The total number of bins that compose the histogram is equal with the number of words K included in the vocabulary. Each bin bi represents the number of occurrences of a visual word wi in W:bk= Card (Dk):
$$ {D}_k=\left\{{d}_{ij},\kern0.5em j\in \left\{1,\dots, n\right\}\kern1em \Big|\kern1em w\left({d}_{ij}\right)={w}_k\right\} $$
(8)
where Dk is the set of descriptors associated to a specific visual word wk in the considered image and Card(Dk) is the cardinality of the set Dk.

VLAD image representation

The vector of locally aggregated descriptors (VLAD) [32] is designed to characterize an image Ii in the dataset by the difference between its local features Di= {di1, di2, …, din} and the codebook words (ck) learned off-line. Then, the residuals of all the descriptors corresponding to the same visual word in the codebook are accumulated. So, for an image the original VLAD representation encodes the interest point /A-HOG descriptors as follows:
$$ {v}_k={\displaystyle \sum_{d_{ij}\in {\mathrm{D}}_{\mathrm{i}};{d}_{ij}\cong {c}_k}{d}_{ij}-{c}_k} $$
(9)

Each centroid ck in the codebook determines a vector of aggregated residuals. The final VLAD signature v is determined by the concatenation of all residual vectors vk. The VLAD dimension is given by the product between the codebook size and the descriptor dimension.

As proposed by Delhumeau et al. [18], in order to reduce the influence of bursty features (caused by repetitive structures in the image) that might dominate other descriptors we use the power-law normalization on VLAD descriptor as down-weight factor:
$$ {\tilde{v}}_l= sign\left({v}_l\right)\cdot \left|{v}_l^{\alpha}\right| $$
(10)
where α is the normalization parameter. It was suggested in [18] that α = 0.2 is a good choice. To further improve the performance of power law normalization, the descriptor is transformed to a different coordinate system using PCA, without any dimensionality reduction.

2.2.3 Image classification

The final step of the obstacle categorization framework can be divided in two stages: an offline process (SVM training) and an online process (SVM prediction).

The offline process is represented by a supervised learning strategy based on SVM (Support Vector Machine) training. The BoVW / VLAD image representation is fed into a SVM that uses a statistical decision procedure in order to differentiate between categories.

We adopted the strategy firstly introduced in [66] designed to find a separation hyperplane, between two classes with respect to maximizing the margin:
$$ \phi (x)= sign\left({\displaystyle \sum_i{y}_i{\alpha}_iK\left(x,{x}_i\right)+b}\right) $$
(11)
where K is the SVM kernel, xi are the training features from the data set, yi the label of xi, b is the hyperplane free term, while αi is a parameter dependent on the kernel type. In our case, we have adopted the RBF kernel and used the implementation available in [36].

The SVM training completes the offline process of our object classification framework.

In the online phase, for each image patch extracted using our obstacle detection method (Section 2.1), we construct the BoVW histogram / VLAD descriptor using the extended A-HOG or relevant interest points with SIFT representation. The SVM classification is then performed to establish the object’s class.

The proposed technique requires a reduced computational power because we are not performing an exhaustive sliding window search within the current frame in order to determine objects and their associated positions.

In our case, the obstacle classification receives as input the location and size of the object that we want to label, determined with the help of the object detection approach. The various elements detected and recognized are finally transmitted and presented to the VI user, with the help of an acoustic feedback.

The object detection and classification framework proposed strongly relies on a motion analysis process. However, it cannot respond to the detection and classification of objects of interest that cannot be identified from the background solely on a motion analysis basis. Such objects are however interesting to consider and exploit, notably in two scenarios. The first one concerns the recognition of landmarks/ buildings in outdoor environments that can improve/ complement the GPS localization abilities. The second one concerns the identification of user-defined objects of interest in indoor environments.

Let us note that in such less critical scenarios, the real-time condition can be slightly relaxed, as long as the system ensures interactive response rates. Thus, a latency of several seconds can be admitted for both landmark recognition and indoor object detection. For this reason, we have adopted a different methodological framework, based on interest point representations, which makes it possible to enrich the capabilities of the system.

The proposed solutions are presented in the following sections.

3 Landmark/building recognition

Although geo-localization solutions have been largely improved in recent years, orientation in dense urban environments is still a complicated task. Since building façades provide key information in this context, the visual approach has become a suitable alternative. Our prime goal is to design a mobile application for automatically identifying buildings in a city.

Numerous attempts were made in solving the building recognition problem. Some of the recent approaches [4] use textured 3D city-models. However, such 3D information is not always available for all cities. Moreover, memory requirements exclude the possibility of a mobile phone implementation. Gronàt et al. [30] suggest treating the place recognition problem as a classification task. They use the available GPS tags to train a classifier for each location in the database in a similar manner to per-example SVMs in object recognition. The approach is efficient only when a big dataset and a large-sized vocabulary are used. On the contrary, in our case, we propose a minimum number of images per class along with a very small-sized vocabulary. Most of the approaches presented in the literature are based on feature representations. In the work of Chen et al. [11], a large-sized vocabulary-tree was used and a tf-idf indexing algorithm scores a reduced number of GPS-tagged images. Many similar approaches use GPS-tag-based pruning method along with custom feature representations. In [2], a rapid window detection and localization method was proposed. The building detection is considered here as a pattern recognition task. In our work, we address this task as a classification problem.

Our contribution is related to the way that a small-sized vocabulary tree is used, combined with the spatial verification step. Instead of matching a query descriptor to its nearest visual word, we perform the matching with respect to a reduced list of interest points extracted from an image in the dataset. This approach makes it possible to overcome the dimensionality problem, which is critical when considering mobile implementations.

3.1 The training phase

First, all the images in the training dataset are used to train a visual word vocabulary, determined with the help of a k-means clustering algorithm [3] performed on the whole set of SIFT descriptors [44] extracted from the training images. Each element in the vocabulary is considered as a visual word consisting of a vector of length 128 (i.e., the length of a SIFT descriptor). The size of the vocabulary was set to 4000 clusters. Then, for each labeled image in the dataset, we extract Difference of Gaussians (DoG) [43] interest points and save their corresponding SIFT descriptors. Then, we divide these vectors into relatively small groups. Each group consists of the list of interest points corresponding to a certain visual word from the vocabulary. The FLANN matcher [49] is used to determine the nearest neighbour (from the considered vocabulary) for each interest point. In order to reduce quantization errors, we propose to include the same interest point in the groups of the 3 nearest neighbours visual words. Next, we save klandmark spatial neighbours for each interest point.

After tests with different values of klandmark we set klandmark to a value of 20, which offers a good trade-off between computational cost and robustness. In fact, we found that with higher values of klandmark the performance is consistent while the time of execution increases exponentially. However, for values of klandmark lower than 20, the number of incorrect results becomes penalizing. In addition, for each interest point we select the spatially nearest 20 points and store their corresponding visual words.

At the end of the training phase, the following 5 components are recorded: the vocabulary file including the visual words vectors, the descriptors of all the interest points in the training set, the descriptors clusters by visual word, the spatial neighbors of each interest point and finally the class name of each interest point.

3.2 The test phase

In the test phase (Fig. 9), we train a kd-tree from the list of visual words in the vocabulary using the FLANN approach as an initialization step.
Fig. 9

Descriptor matching and spatial consistency scheme. Stars represent interest points and squares are visual words from the vocabulary

For each query image, we extract DoG interest points and their corresponding SIFT descriptors. The trained kd-tree is used as a first step to find the corresponding visual word for each interest point. Then, a BruteForce matcher is applied in order to determine the nearest similar descriptor from the cluster of descriptors corresponding to the same visual word. The BruteForce matcher computes the Euclidean distance between query keypoint and each descriptor in the cluster.

This two phase matching method allows us to benefit from both FLANN matcher’s speed search and the BruteForce matcher’s consistency. In fact, the BruteForce search becomes computationally expensive only if the number of comparisons is too high.

In our case, the average number of descriptors in a given cluster is around 250 which ensure a relatively low computational cost.

3.3 Spatial consistency check

To avoid mismatches, a spatial consistency algorithm is further proposed. The principle consists of evaluating the visual words associated to the kNN-SC nearest spatial neighbors for each query interest point. Since we already computed the correspondent visual word for each interest point in an earlier step, we check if among these visual words, there exist a minimum number of similarities within the spatial neighbors of the matched interest point. The same number of spatial neighbors as in the training phase is used here. A candidate point is considered as spatially consistent if it has more than kNN-SC/4 similarities. Each interest point that fails the spatial consistency test is considered as irrelevant. Otherwise, we vote to its corresponding class label.

As a final step, a class histogram collecting the different scores obtained is constructed. Assuming that an image can only be labeled to one class or none, we require that the confidence measure of the top-ranked class to be greater than a predefined threshold (Thtop-ranked). This measure is defined as the ratio between the best score and the number of keypoints in the image. If the top ranked class has a confidence measure less than the fixed threshold, we assume that none of the known classes exists in the query image. A negative label is then returned. Otherwise, the label of the class with the best score is returned. In our experiments, we have considered threshold values between 5 and 15 % which yield relatively stable results. Therefore, we have selected the value of 10 % for the results retained in this paper.

The proposed approach has been totally implemented on an Android smartphone without the need of any server-client communication. It is an offline application that the user can run even without an internet connection. Let us underline that the training phase should be done off-line, on a regular server. Once the training phase achieved, the resulting vocabulary and corresponding descriptors are stored on the smartphone. In this case, we have to limit the number of possible building categories to about 10–15 landmarks. This is however sufficient to deal with a given itinerary and the time of response to the query is of about 2 s.

Let us now describe how such an interest point-based method can be extended/adapted for the detection, localization and tracking of objects of interest in indoor environments.

4 Indoor object of interest detection/localization

Multi-object detection and tracking requires efficient both object recognition and localization approaches. Sliding window techniques seem to be the well-suited for such a task, specifically for classification and localization purposes. Recently, branch-and-bound [37] approaches have been introduced to speed up object recognition and localization. Such techniques rely on a linear scan of all the models, which can be computationally costly when the number of models is large. Other recent methods abandon the sliding window paradigm in favor of segmentation-based approaches. The principle consists of using multiple types of over-segmentations to generate object candidates [27, 68]. Despite the promising results reported, the related limitations come from the relatively high computational burden involved, which makes them inappropriate for mobile applications.

Unlike classification and categorization problems, which need to learn for each category considered a set of objects with a certain amount of variability, our recognition and localization framework is designed to recognize single object models, specified by the user. For this purpose, another alternative to sliding window and segmentation consists of extracting local interest points and group them into pertinent regions. Such methods critically rely on the matching algorithm involved, which establishes correspondences between feature points in the given test image and those present in the training data set. The main advantage is related to the low computational requirements involved.

In our work, we have adopted a feature point matching approach in order to design a simultaneous multi-object recognition and localization framework. The main contribution proposed concerns the reliable matching algorithm proposed. Based on an efficient spatial layout consistency testing, the method achieves significant savings in computational time.

The recognition approach is preceded by an offline stage designed to learn a set of pre-defined objects specified by the user. Local sparse feature points are extracted from a set of training images covering all objects from different views and at various scales. The detection of local interest points is computed using the Difference of Gaussian (DoG) [43]. Each local region is described with the help of SIFT descriptors [44]. For further processing and less expensive computing, we assign to each descriptor a visual word. Therefore, we build a vocabulary of visual words using the k-means clustering algorithm [3], as described in Section 3.

The same procedure is applied to extract local interest points from the given video frame. The interest points are put into correspondence with feature points obtained from the set of training images. A local interest point is classified as belonging to an object instance if it is matched with a feature point of the object from the training set. In order to boost the performances of the method, we introduce a novel matching algorithm. The proposed procedure is based on the verification of the spatial consistency of the matched interest points pair (cf. Section 3.3).

After classifying interest points, the next stage consists of grouping feature points belonging to a same object class into spatial clusters, with the help of hierarchical clustering. Each time a spatial cluster exceeds a number of pre-defined points, it is detected as a new object. A bounding box covering the spatial cluster defines the localization of the object in the test image. The main steps of the object recognition and localization process in the test stage are illustrated in Fig. 10. Let us now detail the various stages involved.
Fig. 10

Overview of the object recognition/localization approach

4.1 Interest point matching

4.1.1 Nearest neighbor searching

Our matching method aims at determining for each interest point in the test image its correspondents in the set of training images. To enable fast matching, we search for the k nearest visual words of each local interest point. The feature points in the training set associated with the determined k nearest visual words are then identified and stored. This preliminary, rough matching based on visual words makes it possible to decrease the number of candidate matches and thus to significantly reduce the computational complexity. The corresponding descriptors are then matched with the one of the considered interest point. The closest m matches are retained in this case. Furthermore, in order to improve the reliability of the matching procedure, we investigate on the spatial layout similarity between the regions around the matched feature points pair.

4.1.2 Spatial layout consistency

The spatial layout consistency procedure proposed extends the method previously introduced in Section 3.3. The availability of interest points and of associated images in the training set makes it possible to perform a finer analysis, by taking into consideration the angles between matched interest points. For each interest point, we define a search area by the r spatial nearest neighbor interest points. The interest point to be classified is called central point, while spatial nearest neighbors are called secondary interest points. Let us consider a central pair match (A, B). The set of secondary feature points that are assigned to the same visual word are considered as matched. We accept a secondary match pair by investigating strictly on their relative position with respect to the SIFT orientation of the central interest point. Let us recall that each SIFT keypoint patch has assigned an orientation reflecting the dominant gradient orientation. A correct secondary match pair (\( {A}_1 \), \( {B}_1 \)) should satisfy the following condition:
$$ \left|\left(\overrightarrow{O_A},\overrightarrow{A{A}_1}\right)-\left(\overrightarrow{O_B},\overrightarrow{B{B}_1}\right)\right|<{\theta}_{\mathbf{th}} $$
(12)
Here, the central match pair is (A, B) and the secondary match pair (\( {A}_1, \)\( {B}_1 \)), while \( \overrightarrow{O_A} \) and \( \overrightarrow{O_B} \) respectively represent the SIFT orientation of the central interest points A and B. The parameter \( {\theta}_{th} \) is a pre-defined threshold (set in our experiments to 30°). Figure 11 illustrates a correct and a rejected secondary match pair.
Fig. 11

Illustration of spatial consistency condition: (\( {A}_1 \), \( {B}_1 \)) is a correct secondary match pair since the condition \( \left|\left(\overrightarrow{O_A},\overrightarrow{A{A}_1}\right)-\left(\overrightarrow{O_B},\overrightarrow{B{B}_1}\right)\right|<{\theta}_{th} \) is verified. The same condition is not satisfied for the secondary match pair (\( {A}_2 \), \( {B}_2 \)) which is rejected

In the case where a correct match satisfying the spatial layout consistency condition was identified, we label the interest point in the test image with the same class of the matched interest point of the training dataset. In the opposite case, the interest point in the test image remains unlabeled.

4.2 Spatial clustering

The spatial clustering process is performed separately for each pre-defined object model. The purpose of this stage is to examine the locations of the same labeled interest points in order to detect the presence of an object and identify its localization. This operation includes two main stages.

4.2.1 Building spatial clusters

We begin by grouping same labeled feature points into spatial clusters with the help of hierarchical clustering. The proposed algorithm initializes each point as a spatial cluster. Two spatial clusters with a distance below a pre-defined threshold are then merged. The distance between two spatial clusters is defined in Eq. (13). \( {C}_1 \) and \( {C}_2 \) are two spatial clusters and P1 and P2 are two interest points, while d is the Euclidian distance between two points:
$$ \boldsymbol{d}\left({\boldsymbol{C}}_1,{\boldsymbol{C}}_2\right)= \min \left(\boldsymbol{d}\left({\boldsymbol{P}}_1\in \boldsymbol{C}1,{\boldsymbol{P}}_2\in {\boldsymbol{C}}_2\right)\right) $$
(13)

This process is repeated iteratively and stopped when all distances between clusters are superior to a pre-defined threshold. Spatial clusters with more than a predefined number of points are finally retained.

4.2.2 Object localization

In some cases, more than one spatial cluster associated with an object model can be identified. Under the assumption that only one instance of an object model can be detected in the test image, two situations can be occur. First, there is the case when the spatial clusters represent different parts of an object. In this situation, they should be merged to define its accurate localization. However, an off target spatial cluster should be discarded. To tackle this issue, we test all possible combinations generated from merging spatial clusters of an object-class. Each combination represents a candidate window for the object’s localization. The purpose of the following process is to identify the candidate window that covers best the localization of an object.

Let us define as measure of similarity S between two images \( {I}_1 \) and \( {I}_2 \) the cosine measure between the vectors \( {V}_1 \) and \( {V}_2 \) of their histograms of Bag of Words.
$$ \mathbf{S}\left({\mathbf{I}}_1,{\mathbf{I}}_2\right)=\frac{{\mathbf{V}}_1^{\mathbf{T}}{\mathbf{V}}_2}{\left\Vert {\mathbf{V}}_1\right\Vert \left\Vert {\mathbf{V}}_2\right\Vert } $$
(14)
We compute the measure of similarity S between a candidate window W covering a combination of spatial clusters in the test image and each of the set of training images T(C) associated with the target object class C. A score assigned to a candidate window is defined as follows:
$$ \boldsymbol{Score}\left(\boldsymbol{W}\right)=\mathbf{max}\left(\mathbf{S}\left(\boldsymbol{W},\boldsymbol{I}\in \boldsymbol{T}\left(\boldsymbol{C}\right)\right)\right. $$
(15)

The candidate window with the highest score defines the bounding box localization of the detected object.

The final stage of the method concerns a tracking procedure, which makes it possible to consistently follow the detected objects along multiple frames. The same method as the one described in Section 2.1 was employed here.

The object localization information is sent to the VI user by acoustic feedback. The warning messages are sent in the same order as objects are detected, with no priorities. When multiple interest objects are identified in the scene in order to not confuse the user, only one warning at a time is generated. After 2 s if another unannounced object still exist in the scene a warning is launched. Let us note that a speech recognition module can be easily integrated in the system in order to allow the VI user to specify the desired object of interest.

5 Experimental results

Let us first present the experimental evaluation of the object detection / classification framework presented in Section 2.1.

5.1 Object/obstacle detection and classification

We tested our system in multiple complex outdoor urban environments with the help of visual impaired users. The videos were also recorded and used to develop a testing database with 30 items. The average duration of each video is around 10 min acquired at 30 fps, at an image resolution of 320 × 240 pixels.

The image sequences are highly challenging because they contain in the same scene multiple static and dynamic obstacles including vehicles, pedestrians or bicycles.

Also, because the recording process is done by VI users, the videos are trembled, noisy, include dark, cluttered and dynamic scenes. In addition, different types of camera/background motions are present.

The annotation of each video was performed frame by frame by a set of human observers. Once the ground truth image data set was developed we objectively evaluated the proposed methodology with the help of two error parameters, denoted by FD and MD, and respectively representing the number of false detected (FD) and missed detected (MD) obstacles. Let us denote by D the total number of correctly detected obstacles. Based on these entities, the most relevant evaluation metrics are the so-called recall (R) and precision (P) rates [53].

The recall and precision rates can be combined within in order to define a unique evaluation measure, so-called F1 norm [53] Table 1 summarizes the results obtained by the obstacle detection module.
Table 1

Experimental evaluation of the obstacle detection module

 

No. Obj.

D

MD

FD

R (%)

P (%)

F1

(%)

Cars

851

778

73

58

91.4

93.1

92.2

People

678

584

94

87

86.1

87.0

86.5

Bikes

315

262

53

43

83.2

85.9

84.5

Static Obs.

587

511

76

62

87.1

89.2

88.1

On the considered database the resulting F1 scores are around and superior to 84 % for all the considered categories. Particular high detection rates are obtained for the cars and static obstacles class.

Let us note that the image resolution is an important parameter, directly affecting the detection performances and conditioning the real-time processing constraint. In our case, the image resolution retained was selected based on a trade-off between the detection rates and the processing speed. With the increase of the video resolution (1280 × 780 pixels) even low contrast or small object can be identified, but the computation burden becomes prohibitively high. On the contrary, with the reduction in image size (176 × 144 pixels) various artifacts appear, caused mostly by camera or object motion. In this case, the motion vectors associated to low resolution objects are reduced or have low amplitude values. Thus, we have finally considered an image resolution of (320 × 240) pixels which, on the considered devices, represents the maximal resolution that still enables real-time performance.

We compared our obstacle detection method with one of the most relevant algorithms in the state of the art, introduced in [50]. On the considered data set, the method in [50] yields an average F1-score of 91 %. However, even though the detection performance increases, the computational time becomes prohibitive (more than 60 s/frame), which makes it unsuited for a real-time scenario.

In Fig. 12 we give a graphical representation of the performance of the obstacle detection module. For each considered video we present four representative images. Various static or dynamic object presented in the scene are represented with different colors (associated to each individual motion class). Due to our temporal consistency step, the motion class associated to an object remains constant between successive frames.
Fig. 12

Experimental results of the obstacle detection and classification framework using ALICE device

From the videos presented in Fig. 12 we can observe that the proposed framework is able to detect and classify, with high accuracy, both dynamic obstacles (e.g. vehicles, pedestrians and bikes) as well as static obstacles (e.g. pillar, road signs and fences). In the case of obstruction our system is able to identify if the objects are situated at the head level or down at the foot area.

Regarding the sensitivity of the detection method, as it can be observed from Fig. 12 a high density of obstacles situated in the near surrounding of the user does not influence the system performance. In the same time, the system shows robustness with abrupt changes in the illumination conditions.

In the second part, we have evaluated the performances of the obstacle classification module. We conducted multiple tests on a set of 2432 image patches that were extracted from the video database using our obstacle detection method.

There are five parameters involved in our object recognition framework: the maximum number of extracted relevant interest points (Npoints), the number of cell (ncell) for A-HOG descriptor, the size of the codebook (C) used in BoVW or VLAD image representation and the gamma parameter (γ) of the SVM-RBF kernel.

In Fig. 13a a comparative diagram of the results is presented, with respect to the variation of the total number of retained interest points after applying our filtering strategy (Section 2.2.1). Comparable results, in terms of F1 score, are obtained for values ranging between 300 and 2000 interest points. In Fig. 13b we illustrate the F1 variation with respect to the total number of cells used to extract the adaptive HOG descriptor. The increase in the number of cells can be translated as a zoom effect over the image patch. So, the discriminative power of the descriptor is in this case reduced.
Fig. 13

System performance evaluation with respect to the parameters involved: a. maximum number of interest points, b. maximum number of cell for A-HOG; c. codebook size used in BoVW; d. codebook size used in VLAD; e. γ parameter of the SVM-RBF kernel

For further experiments we set Npoints to 300 interest points and ncell to 128 which offer the best compromise between classification accuracy and computational speed.

We have studied next the impact of the vocabulary size on the overall system performance.

In Fig. 13c and d we present the F1 score variation with different values of the BoVW and VLAD vocabulary.

As it can be observed, in the case of BOVW a vocabulary with 4000 words returns the best results, while for VLAD the best results are obtained for a codebook with 512 words. However, we have to recall that in the context of VI applications, our objective is to achieve real-time processing.

The classification speed is in this respect a crucial parameter. With the increase of the vocabulary size, the computational complexity significantly increases. So, due to this constraint we adopted a size for BoVW vocabulary of 1000 words, while for VLAD we set WK to 128 words.

In Fig. 13e we show the system performance when varying γ in the SVM-RBF kernel. As expected, different optimal values are obtained for BoVW and VLAD representations. We fixed γ to 50 when adopting the BoVW representation and set γ to 1 in the case of VLAD.

Based on these results, our next goal is to determine the best mix of methods that return high classification rates, in the context of VI application, without extensively increasing the computational time.

The obtained results are summarized in Table 2. Regarding low-level image descriptors it is better to use relevant interest points rather than A-HOG because it retunes high classification rates without extensively increasing the computational time.
Table 2

Experimental evaluation of the different systems considered

Selected framework

F1 score

(100 %)

Processing Time (ms)

A-HOG + BoVW + SVM

78.9

41

A-HOG + VLAD + SVM

82.9

157

Relevant interest points + BoVW + SVM

83.1

55

Relevant interest points + VLAD + SVM

86.6

140

The best results for the F1 score and the processing time are highlighted in bold

For the global image representation, even though BoVW approach can return good results for large vocabularies it cannot scale to more than a few thousand images, on a regular smartphone. VLAD representation is more discriminative and describes images using typically smaller and denser vectors. Beyond the optimized vector representation, high retrieval accuracy and significantly better results are obtained when compared to a classical BoVW, even for small codes consisting of few hundred of bits.

However, regarding the computational burden, the VLAD image representation significantly increases the processing time. A solution to this problem is to reduce the size of the vocabulary from 128 to 64 words.

From the perspective of an application dedicated to VI people, we believe that the optimal solution is the selection of relevant interest points, VLAD and SVM because it offers the best compromise between the performance and processing speed.

In Table 3 we illustrate the performance of the classification system (using relevant interest points, with VLAD image representation and SVM-RBF classification) for each category (i.e. vehicles, bicycles, pedestrian and static obstacles) along with the confusion matrix. As it can be noticed we introduced an extra-category called Outlier in order to make sure that our system assigns an image patch to a category due to its high resemblance to class and not just because it has to.
Table 3

Obstacle classification module performance evaluation: confusion matrix and MC/FC per category

 

Static obstacles

People

Bikes

Cars

Outliers

GT

MC

FC

Precision

Recall

F1 score

Static obstacles

108

8

4

2

5

127

7

19

0.939

0.851

0.89

People

4

355

8

3

7

377

17

22

0.954

0.941

0.94

Bikes

10

8

88

3

6

115

9

27

0.907

0.765

0.83

Cars

3

0

3

297

2

305

13

8

0.958

0.973

0.96

GT Ground Truth, MC Missed Classified, FC False Classified (False Alarms)

The obtained results show high F1 scores, which validates our approach. The recognition scores are particularly high, superior to 83 % in all the cases. Slightly lower performances are obtained in the case of the bicycle class. This is mainly due to the fact that the part of the image corresponding to a bike is quite reduced when compared to the entire detected object (e.g., person riding the bike plus the bike). Thus, confusion appears between the bike and the pedestrian categories (as also shown in Table 3).

Let us note that the classification performances can be enhanced by considering a late fusion approach [24]. In our case, multiple fusion strategies can be considered by combining both multiple cues of information (i.e., relevant interest points descriptors and adapted HOG) and results from various classifiers (i.e., BoVW and VLAD). As indicated in [24], we can expect that the late fusion classification will outperform the results of the best individual descriptors and classifiers. The late fusion process itself is of low complexity. However, it still requires computing the results of different approaches. This additional computational burden remains the main limitation of such an approach in a real-time framework.

In terms of computational complexity, the average processing time of the entire framework (obstacle detection and classification) when run on a regular Android smartphone is around 240 ms per frame, which leads to a processing speed around 5 frames per second.

Regarding the battery consumption our system can be continuously run for 1–1.5 h. We estimate that by equipping the ALICE system with general public external batteries the system’s autonomy can be easily extended to up to 4–5 h of continuous running.

Discussion

One limitation of the obstacle detection framework is given by the failure to identify large, flat structures (e.g., walls, doors). When the user is progressively approaching a large obstacle and its size becomes superior to half of the video scene the system will not be able to correctly distinguish the background information from the foreground objects. In this case, the obstacle will be considered as a part of the background. Moreover, because the algorithm exploits a LKA point tracking procedure, we expect that aliasing problems can occur in such cases.

The only solution for dealing with such structures would be to consider specific, dedicated detection algorithms. Extraction of vertical, flat and/or repetitive elements such as the method presented in [65] can provide hints for building up such a solution.

The evaluation of the standalone building/ landmark recognition approach is presented in the following section.

5.2 Landmark/building recognition

To objectively evaluate our approach, we used two publicly available datasets which are the Zürich Building dataset [62] and Sheffield Building dataset [41]. The Zurich Building Database (ZuBuD) includes of 1005 database images with a resolution of 640 × 480 pixels, representing 5 views of 201 different building facades in Zurich (Fig. 14). For testing purposes, we have considered the 115 query images indicated in [20, 41, 72]. Our method achieves 99.13 % accuracy rate on this dataset which outperforms the state of the art approaches designed for mobile applications. In addition, we can observe that despite significant occlusions, the method is able to correctly identify the considered building.
Fig. 14

An example from Zurich Building dataset. a and c represents the five labeled images used for the training phase. b and d represents two test samples that were successfully matched to the correct class of buildings. The blue interest points were labeled to the correct class while the red interest points were labeled to the wrong class

On the Sheffield dataset (Fig. 15), we have no labeled test images, so, we run a 5-fold cross validation scheme where for each fold, we randomly select one fifth of the dataset images as test samples and the rest as training samples (Table 4).
Fig. 15

An example from Sheffield Building dataset. a and c five labeled images per building used for the training phase. b and d two test samples that were successfully matched to the correct class of buildings. The blue interest points were labeled to the correct class while the red interest points were labeled to the wrong class

Table 4

Fold cross validation results

 

ZuBuD

Sheffield

1st accuracy rate

99.00 %

99.63 %

2nd round accuracy rate

99.00 %

99.87 %

3rd round accuracy rate

99.00 %

99.51 %

4th round accuracy rate

99.50 %

99.87 %

5th round accuracy rate

94.02 %

99.87 %

Average accuracy rate

98.00 %

99.75 %

This dataset consists of 3192 JPEG images with a resolution of 160 × 120 of 40 buildings in Sheffield. We applied the same evaluation scheme on the Zürich dataset as well. The average accuracy rate obtained in this case is of 98 %.

We compared our results on the 115 query images from ZuBuD dataset with those reported in other papers. The recognition rates range from 80 % in [41] to 96.5 % in [72] and 99.1 % in [20].

The recognition rate of our method is of 99.13 %. The only method that exceeds this rate is the one presented in [47], where a 100 % accuracy rate is reported. Their algorithm consists of detecting Maximally Stable Extremal Regions in the images and then describing them by affine co-variant local coordinate systems (called Local Affine Frames, LAFs). A keypoint matching scheme is here performed between the query image and the whole dataset. It is obvious that such method requires significant computational and memory loads. Therefore, it is not well-suited for mobile phone-dedicated implementations. Moreover, the recognition rate in the experiments reported in [47] represents the percentage of finding a correct image match in a top-five ranked list of candidates. In our case, we state the percentage of finding exactly the correct class of buildings.

To our very best knowledge, only the paper in [74] reports significant results on the Sheffield building dataset. Zhao et al. [74] extract the multi-scale GIST features that represent the structural information of the building images with an enhanced fuzzy local maximal marginal embedding (EFLMME) algorithm to project MS-GIST feature manifold onto a low dimensional subspace. The method achieves a maximum recognition rate of 96.90 % on a subset of the dataset with images that were selected manually.

Finally, let us present the results obtained for the indoor object localization/identification and tracking approach introduced in Section 4.

5.3 Indoor object identification and tracking

In order to constitute a ground truth data set, we have retained 40 book covers downloaded from the Stanford mobile visual search data set [9]. For each book cover, we have the acquired 20 training images (at resolution 720 × 480 pixels) representing the object at different poses and scales.

In addition, we have constituted our own data set, composed of a set of 10 target objects commonly used in daily life (books, bottles, remote control, keyboard). Figure 16 shows some examples of the images used in our experiments. Here again, a number of 20 training images has been acquired, representing each object from different views and at various scales in order to learn the object model.
Fig. 16

Examples of training images used in experiments

Starting from these objects, we have constructed a test data set composed of 50 images with a resolution 720 × 480. Unlike the Stanford mobile visual search dataset, images captured for evaluation hold multiple objects of interest, placed in a significantly cluttered background. In addition target objects can cover less than 10 % of the full image.

The test data set counts for a total number of 142 target objects. A correct recognition and localization of an object of interest is obtained when the ground truth bounding box covers more than the half of the detected bounding box. The recall and the precision rates obtained are of 92 and 93 %, respectively. This demonstrates the pertinence of the proposed approach for efficient object detection and recognition.

Figure 17 shows some examples of the objects detected and tracked in cluttered environments.
Fig. 17

Examples of multi-object localization results

We have run our experiments on a mobile device. The running time of our recognition and localization approach took about 2 s. Achieved in 1.5 s, a CPU-only SIFT feature detector and descriptor is by far the most time-consuming stage in the process. A GPU implementation [55] can be a future solution for reducing significantly the recognition execution time. We execute the recognition algorithm on each regular interval of frames. For the tracking process, the running time is estimated to 60 ms per frame. That determines a frame rate of 16.

6 Conclusions and perspectives

In this paper, we introduce a novel computer vision-based perception system, dedicated to the autonomous navigation of visually impaired people. The core of the proposed framework is the obstacle detection and classification methodology designed to identify, in real time, both static obstacles and dynamic object without any knowledge about the object type, position or location. At the end, through bone conduction headphones the acoustic signals are sent to a VI user. At the hardware level the entire framework is embedded on a regular smartphone attached to the VI using a chest mounted harness.

In addition, a landmark/building recognition approach has been proposed, which aims at improving the GPS-based navigation capabilities with the help of visual analysis.

Finally, a localization/tracking of objects of interest defined by the user has been proposed.

We tested our method on different outdoor scenarios with visually impaired participants. The system shows robustness and consistency even for important camera and background movement or for crowded scenes with multiple obstacles.

From the experimental evaluation we determined that our system works in real time returning warring messages fast enough so that the user could walk normally. The algorithms were carefully designed and optimized in order to work efficiently on a low power processing unit. In our opinion this characteristic is one of the major differences between our system and most of the state the art algorithms. Is completely integrated on a regular smartphone thus it can be described as a wearable and friendly device, ready to be used by the VI. The system is low-cost, since it does not require any expensive, dedicated hardware architecture, but solely general public components available at affordable prices on the market. By using a chest mounted harness, regular smartphone and bone conduction headphones our system is wearable and portable. In addition, it is also non-intrusive, satisfying the hands-free and ears-free requirements imposed by VI users. Because the entire processing is performed on the smartphone, no connection to a processing unit is required.

For further work we propose integrating our system in a much more developed assistant which includes: navigation information, stairs and crossings detectors and people recognizer (that helps identify the familiar persons). The use of portable, stereoscopic acquisition devices that can serve to real-time 2D/3D reconstruction purposes and notably provide more precise information related to distances to the detected objects/obstacles is also an interesting axis of future development.

With the emergence of graphical boards integrated on regular smartphones (e.g. NVIDIA TX1) we envisage the integration of deep learning strategies within the object detection and classification processes.

Notes

Acknowledgments

This work has been partially supported by the AAL (Ambient Assisted Living) ALICE project (AAL-2011-4-099), co-financed by ANR (Agence Nationale de la Recherche) and CNSA (Conseil National pour la Solidarité et l’Autonomie).

This work was supported by a grant of the Romanian National Authority for Scientific Research and Innovation, CNCS - UEFISCDI, project number PN-II-RU-TE-2014-4-0202.

References

  1. 1.
    Alahi, Ortiz R, Vandergheynst P (2012) FREAK: fast retina keypoint. In: IEEE Conference on Computer Vision and Pattern Recognition, 2012. CVPRGoogle Scholar
  2. 2.
    Ali H, Paar G, Paletta L (2007) Semantic indexing for visual recognition of buildings, 5th Int Symp Mob Mapp Technol. 6–9Google Scholar
  3. 3.
    Arthur D, Vassilvitskii S (2007) K-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, pp. 1027–1035. http://dl.acm.org/citation.cfm?id=1283383.1283494
  4. 4.
    Baatz G, Köser K, Chen D, Grzeszczuk R, Pollefeys M (2010) Handling urban location recognition as a 2D homothetic problem. In: Daniilidis K, Maragos P, Paragios N (eds) Computer Vision – ECCV 2010 SE - 20, Springer Berlin Heidelberg, pp. 266–279. doi: 10.1007/978-3-642-15567-3_20
  5. 5.
    Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-Up Robust Features (SURF). Comput Vis Image Underst 110:346–359. doi:10.1016/j.cviu.2007.09.014 CrossRefGoogle Scholar
  6. 6.
    Black M, Anandan P (1993) A framework for robust estimation of optical flow. In: International Conference on Computer Vision CVPR, 231–236Google Scholar
  7. 7.
    Blasch BB, Wiener WR, Welsh RL (1997) Foundations of orientation and mobility. In: American Foundation for the Blind, 2nd ed., Press: New YorkGoogle Scholar
  8. 8.
    Brock M, Kristensson PO (2013) Supporting blind navigation using depth sensing and sonification. In Proceedings of the ACM Conference on Pervasive and Ubiquitous Computing, SwitzerlandGoogle Scholar
  9. 9.
    Chandrasekhar VR, Chen DM, Tsai SS, Cheung NM, Chen H, Takacs G et al (2011) The Stanford Mobile Visual Search Data Set, in: Proceedings of the Second Annual ACM Conference on Multimedia Systems, ACM, New York, NY, USA, pp. 117–122. doi: 10.1145/1943552.1943568
  10. 10.
    Chaudhry, Chandra R (2015) Design of a mobile face recognition system for visually impaired persons. CoRR, vol. abs/1502.00756Google Scholar
  11. 11.
    Chen DM, Baatz G, Koser K, Tsai SS, Vedantham R, Pylvanainen T et al (2011) City-scale landmark identification on mobile devices, Computer Vision and Pattern Recognition (CVPR), 2011 I.E. Conference on. 737–744. doi: 10.1109/CVPR.2011.5995610
  12. 12.
    Chen L, Guo B, Sun W (2010) Obstacle detection system for visually impaired people based on stereo vision. In Proceedings of the 4th International Conference on Genetic and Evolutionary Computing, Shenzhen, China, 13–15Google Scholar
  13. 13.
    Csurka G, Bray C, Dance C, Fan L (2004) Visual categorization with bags of keypoints, Workshop on Statistical Learning in Computer Vision, ECCV. 1–22Google Scholar
  14. 14.
    Dakopoulos D, Boddhu SK, Bourbakis N (2007) A 2D vibration array as an assistive device for visually impaired, bioinformatics and bioengineering, 2007. BIBE 2007. Proceedings of the 7th IEEE International Conference on. 930–937. doi: 10.1109/BIBE.2007.4375670
  15. 15.
    Dakopoulos D, Bourbakis N (2008) Preserving visual information in low resolution images during navigation of visually impaired. In: Proceedings of the 1st International Conference on PErvasive Technologies Related to Assistive Environments, ACM, New York, NY, USA, pp. 27:1–27:6. doi: 10.1145/1389586.1389619
  16. 16.
    Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection, Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on. 1 886–893 vol. 1. doi: 10.1109/CVPR.2005.177
  17. 17.
    Dalal N, Triggs B (2006) Object detection using histograms of oriented gradients. In: European Conference on Computer VisionGoogle Scholar
  18. 18.
    Delhumeau J, Gosselin P-H, Jégou H, Pérez P (2013) Revisiting the VLAD image representation. In: ACM Multimedia, 653–656Google Scholar
  19. 19.
    Ding C, He X (2004) K-means clustering via principal component analysis. In: Proceedings of the Twenty-First International Conference on Machine Learning, ACM, New York, NY, USA, pp. 29–. doi: 10.1145/1015330.1015408
  20. 20.
    El Mobacher A, Mitri N, Awad M (2013) Entropy-based and weighted selective SIFT clustering as an energy aware framework for supervised visual recognition of man-made structures. Math Probl EngGoogle Scholar
  21. 21.
    Erhan D, Szegedy C, Toshev A, Anguelov D (2014) Scalable object detection using deep neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2155–2162Google Scholar
  22. 22.
    Everingham M, Gool L, Williams CK, Winn J, Zisserman A (2010) The Pascal Visual Object Classes (VOC) challenge. Int J Comput Vis 88:303–338. doi:10.1007/s11263-009-0275-4 CrossRefGoogle Scholar
  23. 23.
    Farabet C, Couprie C, Najman L, LeCun Y (2013) Learning hierarchical features for scene labeling. Pattern Analysis and MachineIntelligence, IEEE Transactions on, pp. 1–15Google Scholar
  24. 24.
    Fernando B, Fromont E, Muselet D, Sebban M (2012) Discriminative feature fusion for image classification. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3434–3441Google Scholar
  25. 25.
    Gauglitz S, Hollerer T, Turk M (2011) Evaluation of interest point detectors and feature descriptors for visual tracking. Int J Comput Vis, pages 1–26Google Scholar
  26. 26.
    Girshick RB, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587Google Scholar
  27. 27.
    Girshick R, Donahue J, Darrell T, Malik J (2014) Rich feature hierarchies for accurate object detection and semantic segmentation, Computer Vision and Pattern Recognition (CVPR), 2014 I.E. Conference on. 580–587. doi: 10.1109/CVPR.2014.81
  28. 28.
    Golledge RG, Marston JR, Costanzo CM (1997) Attitudes of visually impaired persons towards the use of public transportation. J Vis Impair Blindness 90:446–459Google Scholar
  29. 29.
    Grauman K, Bastian L (2011) Visual object recognition. Morgan & Claypool, San FranciscoGoogle Scholar
  30. 30.
    Gronat P, Obozinski G, Sivic J, Pajdla T (2013) Learning and calibrating per-location classifiers for visual place recognition, Computer Vision and Pattern Recognition (CVPR), 2013 I.E. Conference on. 907–914. doi: 10.1109/CVPR.2013.122
  31. 31.
    Harris C, Stephens M (1988) A combined corner and edge detector. In: Alvey Vision Conference, 147–151Google Scholar
  32. 32.
    Jegou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. In PAMI 33(1):117–128CrossRefGoogle Scholar
  33. 33.
    Johnson LA, Higgins CM (2006) A navigation aid for the blind using tactile-visual sensory substitution. Eng Med Biol Soc 2006. EMBS’06. 28th Annual International Conference of the IEEE. 6289–6292. doi: 10.1109/IEMBS.2006.259473.
  34. 34.
    José J, Farrajota M, Rodrigues João MF, Hans du Buf JM (2011) The smart vision local navigation aid for blind and visually impaired persons. Int J Digit Content Technol Appl 5:362–375Google Scholar
  35. 35.
    Khan A, Moideen F, Lopez J, Khoo WL, Zhu Z (2012) KinDetect: kinect detection objects. In: Computer Helping People with Special Needs, LNCS7382, 588–595Google Scholar
  36. 36.
    Kuo BC, Ho HH, Li CH, Hung CC, Taur JS (2014) A kernel-based feature selection method for SVM with RBF kernel for hyperspectral image classification, selected topics in applied earth observations and remote sensing. IEEE J 7:317–326. doi:10.1109/JSTARS.2013.2262926 Google Scholar
  37. 37.
    Lampert CH, Blaschko MB, Hofmann T (2008) Beyond sliding windows: object localization by efficient subwindow search, Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. 1–8. doi: 10.1109/CVPR.2008.4587586
  38. 38.
    Lee JJ, Kim G (2007) Robust estimation of camera homography using fuzzy RANSAC. In: Proceedings of the 2007 International Conference on Computational Science and Its Applications - Volume Part I, Springer-Verlag, Berlin, Heidelberg, pp. 992–1002. http://dl.acm.org/citation.cfm?id=1802834.1802930
  39. 39.
    Lepetit CV, Strecha C, Fua P () BRIEF: binary robust independent elementary features. 11th European Conference on Computer Vision (ECCV), Heraklion, Crete. LNCS Springer, September 2010Google Scholar
  40. 40.
    Leutenegger S, Chli M, Siegwart R (2011) Brisk: binary robust invariant scalable keypoints. IEEE International Conference on Computer Vision (ICCV)Google Scholar
  41. 41.
    Li J, Allinson NM (2009) Dimensionality reduction-based building recognition. In: 9th IASTED International Conference on VisualizationGoogle Scholar
  42. 42.
    Lin Q, Hahn HS, Han YJ (2013) Top-view based guidance for blind people using directional ellipse model. Int J Adv Robot Syst 1:1–10Google Scholar
  43. 43.
    Lowe DG (1999) Object recognition from local scale-invariant features, Computer Vision, 1999. The Proceedings of the Seventh IEEE International Conference on. 2, 1150–1157 vol.2. doi: 10.1109/ICCV.1999.790410
  44. 44.
    Lowe D (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60:91–110. doi:10.1023/B:VISI.0000029664.99615.94 CrossRefGoogle Scholar
  45. 45.
    Lucas B, Kanade T (1981) An iterative technique of image registration and its application to stereo. In: IJCAI’81 Proceedings of the 7th international joint conference on Artificial intelligence, 2, 674–679Google Scholar
  46. 46.
    Manduchi R (2012) Mobile vision as assistive technology for the blind: an experimental study. In: Proceedings of the 13th International Conference on Computers Helping People with Special Needs - Volume Part II, Springer-Verlag, Berlin, Heidelberg, pp. 9–16. doi: 10.1007/978-3-642-31534-3_2
  47. 47.
    Matas J, Chum O, Urban M, Pajdla T (2004) Robust wide-baseline stereo from maximally stable external regions. Image Vis Comput 22:761–767. doi:10.1016/j.imavis.2004.02.006 CrossRefGoogle Scholar
  48. 48.
    Meers S, Ward K (2005) A substitute vision system for providing 3D perception and GPS navigation via electro-tactile stimulation. In: 1st International Conference on Sensing Technology, 21–23Google Scholar
  49. 49.
    Muja M, Lowe DG (2009) Fast approximate nearest neighbors with automatic algorithm configuration. In: VISAPP International Conference on Computer Vision Theory and Applications, pp. 331–340Google Scholar
  50. 50.
    Oneata D, Revaud J, Verbeek J, Schmid C (2014) Spatio-temporal object detection proposals. Europeean Conference on Computer Vision, ECCV 2014 - European Conference on Computer Vision, volume 8691, pages 737–752, Zurich, Switzerland, SpringerGoogle Scholar
  51. 51.
    Pascolini D, Mariotti SP (2012) Global data on visual impairments 2010, in: World Health Organization, GenevaGoogle Scholar
  52. 52.
    Peng E, Peursum P, Li L, Venkatesh S (2010) A smartphone-based obstacle sensor for the visually impaired. In: Yu Z, Liscano R, Chen G, Zhang D, Zhou X (eds) Ubiquitous Intelligence and Computing SE - 45, Springer Berlin Heidelberg, pp. 590–604. doi: 10.1007/978-3-642-16355-5_45
  53. 53.
    Powers DMW (2011) Evaluation: from precision, recall and F measure to roc, informedness, markedness and correlation. J Mach Learn Technol 2(1):37–63MathSciNetGoogle Scholar
  54. 54.
    Pradeep V, Medioni G, Weiland J (2010) Robot vision for the visually impaired, Computer Vision and Pattern Recognition Workshops (CVPRW), 2010 I.E. Computer Society Conference on. 15–22. doi: 10.1109/CVPRW.2010.5543579
  55. 55.
    Rister B, Wang G, Wu M, Cavallaro JR (2013) A fast and efficient sift detector using the mobile GPU, Acoustics, Speech and Signal Processing (ICASSP), 2013 I.E. International Conference on. 2674–2678. doi: 10.1109/ICASSP.2013.6638141
  56. 56.
    Rodríguez A, Yebes JJ, Alcantarilla PF, Bergasa LM, Almazán J, Cela A (2012) Assisting the visually impaired: obstacle detection and warning system by acoustic feedback. Sensors 12:17476–17496. doi:10.3390/s121217476 CrossRefGoogle Scholar
  57. 57.
    Rosa S, Paleari M, Ariano P, Bona B (2012) Object tracking with adaptive HOG detector and adaptive Rao-Blackwellised particle filter. Proceedings of SPIE 8301, Intelligent Robots and Computer Vision XXIX: Algorithms and Techniques, 83010 W. doi:10.1117/12.911991
  58. 58.
    Rublee E, Rabaud V, Konolige K, Bradski G (2011) ORB: an efficient alternative to SIFT or SURF. Computer Vision (ICCV), 2011 I.E. International Conference on, vol., no., pp. 2564–2571, 6–13Google Scholar
  59. 59.
    Saez JM, Escolano F (2008) Stereo-based aerial obstacle detection for the visually impaired. In: Workshop on Computer Vision Applications for the Visually Impaired, Marselle, FranceGoogle Scholar
  60. 60.
    Saez JM, Escolano F, Penalver A (2005) First steps towards stereo-based 6DOF SLAM for the visually impaired, computer vision and pattern recognition - workshops, 2005. CVPR Workshops. IEEE Computer Society Conference on. 23. doi: 10.1109/CVPR.2005.461
  61. 61.
    Sainarayanan G, Nagarajan R, Yaacob S (2007) Fuzzy image processing scheme for autonomous navigation of human blind. Appl Soft Comput 7:257–264CrossRefGoogle Scholar
  62. 62.
    Shao H, Svoboda1 T, Tuytelaars T, Van Gool L (2003) HPAT indexing for fast object/scene recognition based on local appearance. In: E. Bakker, M. Lew, T. Huang, N. Sebe, X. Zhou (Eds.), Image and Video Retrieval SE - 8, Springer Berlin Heidelberg, pp. 71–80. doi: 10.1007/3-540-45113-7_8
  63. 63.
    Szegedy C, Toshev A, Erhan D (2013) Deep neural networks for object detection. In: Annual Conference on Neural Information Processing Systems, pp. 2553–2561Google Scholar
  64. 64.
    Takizawa H, Yamaguchi S, Aoyagi M, Ezaki N, Mizuno S (2012) Kinect cane: an assistive system for the visually impaired based on three-dimensional object recognition. In Proceedings of IEEE International Symposium on System Integration, JapanGoogle Scholar
  65. 65.
    Tian Y, Yang X, Arditi A (2010) Computer vision-based door detection for accessibility of unfamiliar environments to blind persons. In: Proceedings of the 12th International Conference on Computers Helping People with Special Needs, Springer LNCS, vol. 6180, pp. 263–270Google Scholar
  66. 66.
    Tong S, Chang E (2001) Support vector machine active learning for image retrieval. In: Proceedings of the Ninth ACM International Conference on Multimedia, ACM, New York, NY, USA, pp. 107–118. doi: 10.1145/500141.500159
  67. 67.
    Tuzel O, Porikli F, Meer P (2006) Region covariance: “a fast descriptor for detection and classification”. In ECCV 3952:589–600Google Scholar
  68. 68.
    van de Sande KEA, Uijlings JRR, Gevers T, Smeulders AWM (2011) Segmentation As Selective Search for Object Recognition, in: Proceedings of the 2011 International Conference on Computer Vision, IEEE Computer Society, Washington, DC, USA, pp. 1879–1886. doi: 10.1109/ICCV.2011.6126456
  69. 69.
    Vinyals A, Toshev A, Bengio S, Erhan D (2015) Show and tell: a neural image caption generator. In: International Conference on Computer Vision and Pattern Recognition (CVPR)Google Scholar
  70. 70.
    Wang HC et al (2015) Bridging text spotting and SLAM with junction features. Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, Hamburg, pp. 3701–3708Google Scholar
  71. 71.
    Yu JH, Chung HI, Hahn HS (2009) Walking assistance system for sight impaired people based on a multimodal transformation technique. In Proceedings of the ICROS-SICE International Joint Conference, JapanGoogle Scholar
  72. 72.
    Zhang W (2005) Localization based on building recognition. In: IEEE Workshop on Applications for Visually Impaired, pp. 21–28Google Scholar
  73. 73.
    Zhang M, Zhou Z (2005) A k-nearest neighbor based algorithm for multilabel classification. In: IEEE International Conference on Granular Computing 2, 718–721Google Scholar
  74. 74.
    Zhao C, Liu C, Lai Z (2011) Multi-scale gist feature manifold for building recognition. Neurocomput 74:2929–2940. doi:10.1016/j.neucom.2011.03.035 CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  1. 1.ARTEMIS DepartmentInstitut Mines-Télécom / Télécom SudParis, UMR CNRS MAP5 8145ÉvryFrance
  2. 2.Telecommunication Department, Faculty of ETTIUniversity “Politehnica” of BucharestBucharestRomania

Personalised recommendations