Keywords

1 Introduction

Recently, the problem of navigation of humanoid robots based only on monocular vision has raised much interest. Many research has been reported for this problem in the context of wheeled mobile robots. In particular, the visual memory approach [1] has been largely studied. It mimics the human behavior of remembering key visual information when moving in unknown environments, to make the future navigation easier.

Robot navigation based on a visual memory consists of two stages [1]. First, in a learning stage, the robot creates a representation of an unknown environment by means of a set of key images that forms the so-called visual memory. Then, in an autonomous navigation stage, the robot has to reach a location associated to a desired key image by following a visual path. That path is defined by a subset of images of the visual memory that topologically connects the most similar key image compared with the current view of the robot with the target image.

Few work has been done for humanoids navigation based on a visual memory [2, 3]. In both works, the robot is not initially kidnapped but it starts the navigation from a known position. In this context, the main interest in this paper is to solve an appearance-based localization problem, where the current image is matched to a known location only by comparing images [4]. In particular, we address the localization of humanoid robots using only monocular vision.

This paper addresses the problem of determining the key image in a visual memory that is the most similar in appearance to the current view of the robot (input image). Figure 1 presents a general diagram of the problem. Consider that the visual memory consists of n ordered key images (\(\mathcal {I}_{1}^{*},\mathcal {I}_{2}^{*},...,\mathcal {I}_{n}^{*}\)). The robot is initially kidnapped and the current view \(\mathcal {I}\) has to be compared with the n key images and the method should give as an output the most similar key image \(\mathcal {I}_{o}^{*}\) within the visual memory.

Since a naive comparison might take too much time depending on the size of the visual memory, we propose to take advantage of a method that compresses the visual memory into a compact, efficient to access representation: the visual bag of words (VBoW) [5]. A bag of words is a structure that represents an image as a numerical vector, allowing fast images comparisons. In robotics, the VBoW approach has been used in particular for loop-closure in simultaneous localization and mapping (SLAM) [6, 7], where re-visited places have to be recognized.

Fig. 1.
figure 1

General diagram of the appearance-based localization from a visual memory.

Fig. 2.
figure 2

Example of images from an onboard camera of a humanoid NAO robot.

In this paper, a quantitative evaluation of the VBoW approach using different local descriptors is carried out. In particular, we evaluate the approach on real datasets captured by a camera mounted in the head of a small-size humanoid robot. The images are affected by issues related to the sway motion introduced by the humanoid locomotion: blurring and rotation around the optical axis. A specific visual vocabulary is proposed to tackle those issues. Figure 2 shows two examples of images captured from our experimental platform: a NAO humanoid robot. These images are of 640 \(\times \) 480 pixels.

The paper is organized as follows. Section 2 introduces the local descriptors included in our evaluation. Section 3 details the VBoW approach as implemented. Section 4 presents the results of the experimental evaluation and Sect. 5 gives some conclusions.

2 Local Descriptors

Local features describe regions of interest of an image through descriptor vectors. In the context of image comparison, groups of local features should be robust against occlusions and changes in view point, in contrast to global methods. From the existing local detectors/descriptors, we wish to select the best option for the specific task of appearance-based humanoids localization. Hereafter, we introduce the local descriptors selected for a comparative evaluation.

2.1 Real-Valued Descriptors

A popular keypoint detector/descriptor is SURF (Speeded Up Robust Features) [8]. It has good properties of invariance to scale and rotation. SURF keypoints can be computed and compared much faster than their previous competitors. Thus, we selected SURF as a real-valued descriptor to be compared in our localization framework. The detection is based on the Hessian matrix and uses integral images to reduce the computation time. The descriptor combines Haar-wavelet responses within the interest point neighborhood and exploits integral images to increase speed. In our evaluation, the standard implementation of SURF (vector of dimension 64) included in the OpenCV library is used.

2.2 Binary Descriptors

Binary descriptors represent image features by binary strings instead of floating-point vectors. Thus, the extracted information is very compact, occupies less memory and can be compared faster. Two popular binary descriptors have been selected for our evaluation: Binary Robust Independent Elementary Features (BRIEF [9]) and Oriented FAST and Rotated BRIEF (ORB [10]). Both use variants of FAST (Features from Accelerated Segment Tests) [11], i.e. they detect keypoints by comparing the gray levels along a circle of radius 3 to the gray level of the circle center. In average, most pixels can be discarded soon, hence the detection is fast. BRIEF uses the standard FAST keypoints while ORB uses oFAST keypoints, an improved version of FAST including an adequate orientation component. The BRIEF descriptor is a binary vector of user-choice length where each bit results from an intensity comparison between some pairs of pixels within a patch around keypoints. The patches are previously smoothed with a Gaussian kernel to reduce noise. They do not include information of rotation or scale, so they are hardly invariant to them. This issue can be overcome by using the rotation-aware BRIEF descriptor (ORB), which computes a dominant orientation between the center of the keypoint and the intensity centroid of its patch. The BRIEF comparison pattern is rotated to obtain a descriptor that should not vary when the image is rotated in the plane. In our evaluation, we use oFAST keypoints given by the ORB detection method as implemented in OpenCV along with BRIEF with size of patches 48 and descriptor length 256. The ORB implementation is the one of OpenCV with descriptors of 256 bits.

2.3 Color Descriptors

We also evaluate the image comparison approach by using only color information. To do so, we use rectangular patches and a color histogram is associated to each patch as a descriptor. We select the color space HSL (Hue-Saturation-Lightness) because its three components are more natural to interpret and less correlated than in other color spaces. Also, only the H and S channels are used, in order to achieve robustness against illumination changes. The color descriptor of each rectangular patch is formed by a two-dimensional histogram of hue and saturation and the length of the descriptor was set experimentally to 64 bits. Three different alternatives are evaluated using color descriptors:

  • Random patches: 500 patches of size 48 \(\times \) 64, randomly selected. This option is referred to as Color-Random.

  • Uniform grid: A uniform grid of 19 \(\times \) 19 patches covering the image, with patches overlapped a half of their size. This option is referred to as Color-Whole.

  • Uniform grid on half of the image: Instead of using the whole image, only the upper half is used. This is because the inferior parts, when taken by the humanoid robot, are mainly projections of the floor and do not discrimante well for localization purposes. This option is referred to as Color-Half.

3 Visual Bag of Words for Humanoid Localization

As mentioned above, this work relies on the hierarchical visual bag of words approach [12] to combine the high descriptive power of local descriptors with the versatility and robustness of histograms. In Sect. 3.1, we recall the main characteristics of [12], and then in Sect. 3.2, we introduce a novel use of the BRIEF descriptor suited within a VBoW approach in the context of humanoid robots.

3.1 Hierarchical Visual Bag of Words Approach

The visual bag of words approach first discretizes the local descriptors space in a series of words, i.e., clusters in the local descriptors space. Here, we followed the strategy of Nister et al. [12], who perform this step in a hierarchical way: in the set of n key images \(\mathcal {I}_{1}^{*},\mathcal {I}_{2}^{*},...,\mathcal {I}_{n}^{*}\) forming the visual memory, a pool of D local descriptors is detected, as illustrated in Fig. 3, left. The local descriptors can be extracted by any of the methods mentioned before. Given a branch factor k (a small integer, typically 3 or 4), the idea is to form k clusters among the D descriptors by using the kmeans++ algorithm. Then, the sets of descriptors associated to these k clusters are recursively clustered into other k clusters, and so on, up to a maximum depth of L levels. At each level, the formed clusters are associated to a representative descriptor chosen randomly (by the kmeans++ algorithm) that will be compared with new descriptors. The leaves of this tree of recursively refined clusters correspond to the visual words, i.e., the clusters in the local descriptors space. The advantage is that, when faced with descriptors found in new images, it is computationally efficient to associate them to a visual word, namely with kL distance computations, i.e., k at each level. Since we obtain \(W=k^L\) leaves (i.e., words), characterizing a descriptor as a word is done in \(O(k \log _k W)\) operations, where W is the number of words, instead of the W computations with a naive approach. This principle is illustrated in Fig. 4.

Fig. 3.
figure 3

Representation of an image in visual words.

Fig. 4.
figure 4

Hierarchical approach of visual words search: When a new descriptor is found in some image \(\mathcal I\), it is recursively compared to representatives of each cluster.

When handling a new image \(\mathcal {I}\), d descriptors are extracted, and each of these is associated to a visual word as explained. This way, we obtain an empirical distribution of the visual words in \(\mathcal {I}\), in the form of a histogram of visual words \(v(\mathcal I)\) (see Fig. 3, right). Now the content of \(\mathcal {I}\) can be compared with the one of any of the key images \(\mathcal {I}_{i}^{*}\) by comparing their histograms. Of course, because n may be very high, it is out of question to compare the histogram of \(\mathcal {I}\) with the n histograms of the key images. That is why an important element in this representation is the notion of inverse dictionary: for each visual word, one stores the list of images containing this word. Then, on a new image, we can easily determine, for each visual word it contains, the list of key images also containing this word. To limit the comparisons, we restrain the search for the most similar images to the subset of key images having at least 5 visual words in common. For an image \(\mathcal I\), each histogram entry \(v_i\) (where i refers to the visual word) is defined as:

$$ v_i(\mathcal I) = \frac{c_i(\mathcal I)}{c(\mathcal I)}\log (\frac{n}{n_i}) $$

where \(c(\mathcal I)\) is the total number of descriptors present in \(\mathcal I\), \(c_i(\mathcal I)\) the numbers of descriptors in \(\mathcal I\) classified as word i, and \(n_i\) the number of key images where the word i has been found. The log term allows to weight the frequency of word i in \(\mathcal I\) in function of its overall presence: When a word is present everywhere in the database, then the information of its presence is not that pertinent.

Last, we should choose how to compare histograms. After intensive comparisons made among the most popular metrics for histograms, we have chosen the \(\chi ^2\) distance, that compares two histograms v and w through:

$$ \chi ^2(v,w) = \sum _{i=0}^W \frac{(\hat{v}_i-\hat{w}_i)^2}{\hat{v}_i+\hat{w}_i}, $$

where \(\hat{v}=\frac{1}{\Vert v\Vert _1}v\). Fig. 5 sums up the whole methodology.

Fig. 5.
figure 5

Complete method of image comparison based on visual bag of words.

3.2 A BRIEF-based Vocabulary for Humanoids Localization

We introduce a novel use of the BRIEF descriptor suited within a VBoW approach in the context of humanoid robots. This is a specific vocabulary that we called BRIEFROT, which deals with the issues generated by the humanoid locomotion. BRIEFROT possesses three independent internal vocabularies, two of which are rotated a fixed angle: one anti-clockwise, the other clockwise. Through experimentation, we found that for the NAO humanoid platform a suitable value for the rotation of the vocabularies is 10 degrees. These rotated vocabularies were implemented with the idea of settling the slight variations in the rotation caused by the locomotion of these robotic systems. The rotated vocabularies represent to the images of the visual memory as rotated images. The third vocabulary is identical to the normal BRIEF. The idea of using three vocabularies is that if the input image is rotated with respect to any image of the visual memory, then the image is detected by any of the rotated vocabularies; if the input image is not rotated with respect to any image of the visual memory, then it is detected with the vocabulary without rotation. Additionally, the detected local patches are smoothed with a Gaussian kernel to reduce the blur effect.

4 Experimental Evaluation

We evaluated the local descriptors mentioned in Sect. 2 on 4 datasets. We used three datasets in indoor environments (CIMAT-NAO-A, CIMAT-NAO-B and Bicocca) and one outdoors (New College). The tests were done in a laptop using Ubuntu 12.04 with 4 Gb of RAM and 1.30 GHz processor.

4.1 Description of the Evaluation Datasets

The CIMAT-NAO-A dataset was acquired with a NAO humanoid robot inside CIMAT. This dataset contains 640 \(\times \) 480 images of good quality but also blurry ones. Some images are affected by rotations introduced by the humanoid locomotion or by changes of lighting. We used 187 images, hand-selected, as a visual memory and 258 images for testing. The CIMAT-NAO-B dataset was also captured indoors at CIMAT with the humanoid robot. It also contains good quality and blurry 640 \(\times \) 480 images, but it does not have images with drastic light changes, as in the previous dataset. We used 94 images as a visual memory and 94 images for testing. Both datasets CIMAT-NAO-A and CIMAT-NAO-B are available in http://personal.cimat.mx:8181/~hmbecerra/CimatDatasets.zip.

The Bicocca 2009-02-25b dataset is available online [13] and was acquired by a wheeled robot inside a university. The 320 \(\times \) 240 images have no rotation around the optical axis nor blur. We used 120 images as a visual memory and 120 images for testing. Unlike the three previous datasets that were obtained indoors, the New College dataset was acquired outside the Oxford University by a wheeled robot [14], with important light changes. The 384 \(\times \) 512 images are of good quality with no rotation nor blur. For this dataset, 122 images were chosen as a visual memory and 117 images for testing.

4.2 Evaluation Metrics

Since the goal of this work is to evaluate different descriptors in VBoW approaches, it is critical to define corresponding metrics to assess the quality of the result from our application. We propose two metrics; the first one is:

$$ \mu _1(\mathcal {I}) = {{\mathrm{rank}}}(\bar{k}(\mathcal {I})) $$

where \(\bar{k}(\mathcal {I})\) is defined as the ground truth index of the key image associated to \(\mathcal {I}\). In the best case, the rank of the closest image to ours should be one, so \(\mu _1(\mathcal {I})=1\) means that the retrieval is perfect, whereas higher values correspond to worse evaluations. The second metric is:

$$ \mu _2(\mathcal {I}) = \sum \limits _l \frac{z_l(\bar{k}(\mathcal {I}))}{\sum \limits _{l'} z_{l'}(\bar{k}(\mathcal {I}))} {{\mathrm{rank}}}(l) $$

where the \(z_l(k)\) is the similarity score between the key images k and l inside the visual memory. This metric is proposed to handle similar key images within the dataset; hence, with this metric, the final score integrates weights (normalized by \(\sum _{l'} z_{l'}(\bar{k}(\mathcal {I}))\) to sum to one) from the key images l similar to the closest ground truth image \(\bar{k}(\mathcal {I})\); this ensures that all the closest images are well ranked.

4.3 Parameters Selection

There are three free parameters: the number of clusters k, the tree depth L and the measure of similarity. Tests were performed by varying k from \( k = 8\) to \( k = 10 \) and varying L from \( L = 4 \) to \( L = 8 \). Also, different similarity measures between histograms were tested: L1-Norm, L2-Norm, \(\chi ^2\), Bhattacharyya and dot product. To do the tests, we generated ground truth data, by defining manually the most similar key image to the input image. The parameters were selected so that the confidence levels \(\mu _2\) were close to 1. We obtained the best results with \( k = 8 \), \( L = 8 \) and the metric \(\chi ^2\). The dataset used for the parameters selection was the CIMAT-NAO-A, since it is the most challenging dataset for the type of images it contains.

4.4 Analysis of the Results Obtained on the Evaluation Datasets

We present the results in the following tables for the seven vocabularies created. On the one hand, the efficiency of the vocabularies is observed using the confidence \(\mu _1\). In this case, the threshold chosen for the test to be classified as correct was 1. On the other hand, for the level of confidence \( \mu _2 \) we choose a threshold of 2.5, all tests below 2.5 were considered correct. This level of confidence takes into account the possible similarity between the images on the visual memory.

Table 1. Percentages of correct results for the dataset CIMAT-NAO-A.
Table 2. Percentages of correct results for the dataset CIMAT-NAO-B.
Table 3. Percentages of correct results for the dataset Bicocca25b.
Table 4. Percentages of correct results for the dataset New College.

In Table 2, we present the results obtained for the CIMAT-NAO-A dataset. In this case, the BRIEFROT vocabulary obtained the best behavior for both levels of confidence. For the case of \(\mu _1\), it has an efficiency of 60.85\(\,\%\) and for \(\mu _2\) of 75.19\(\,\%\). Also, the ORB vocabulary offered good performance for \(\mu _2\) and was the second best for \(\mu _1\). The Color-Half vocabulary obtained the worst results. On the other hand, for the CIMAT-NAO-B dataset, the SURF vocabulary behaved better than BRIEFROT, but with higher computation times. The times reported were measured from the stage of features extraction to the stage of comparison. ORB, again, behaved well and was the second best vocabulary. The Color-Random vocabulary obtained the worst performance for \(\mu _1\), but for \(\mu _2\) it was one of the best vocabulary; this means that it tends to put the correct key image in the second rank. Color-Half had the worst results for \(\mu _2\).

In the Bicocca 2009-02-25b dataset, three vocabularies obtained the best results for \(\mu _1\): BRIEFROT, Color-Random and SURF. The difference between these three vocabularies is in the computation time: SURF consumes much more time. For \(\mu _2\), BRIEFROT was the best. In the New College dataset, the SURF vocabulary obtained the best behavior for both levels of confidence. In both cases the BRIEFROT vocabulary obtained a good behavior, close to SURF, but BRIEFROT consumes less than half the time required by SURF (Tables 13 and 4).

5 Conclusions

This paper addresses the problem of vision-based localization of humanoid robots, i.e., determining the most similar image among a set of previously acquired images (visual memory) to the current robot view. To this end, we use a hierarchical visual bag of words (VBoW) approach. A comparative evaluation of local descriptors to use to feed the VBoW is reported: Real-valued, binary and color descriptors were compared on real datasets captured by a small-size humanoid robot. We presented a novel use of the BRIEF descriptor suited to the VBoW approach for humanoid robots: BRIEFROT. According to our evaluation, the BRIEFROT vocabulary is very effective in this context, as reliable as SURF to solve the localization problem, but in much less time. We also show that keypoints-based vocabularies performed better than color-based vocabularies.

As future work, we will explore the combination of visual vocabularies to robustify the localization results. We will implement the method onboard the NAO robot using a larger visual memory. We also wish to use the localization algorithm in the construction of the visual memory to identify revisited places.