1 Introduction

Assistive technology [13, 30] helps people with different disabilities to overcome their daily-life problems [7]. In the literature, there are many attempts to support blind or visually impaired people using for this purpose: social applications [7], text readers [10], Global Positioning System (GPS) [41], radio-frequency identification (RFID) tags [15, 27], radio beacons [37], QR codes [14], visual markers [20, 33], LED markers [26], ultra-wideband (UWB) technology [22], infrared (IR) cameras [18], or ultrasonic sensors [33]. Most of these approaches aim to provide an accurate position of a blind person relying on a device attached to the location of interest. RFID tags are often used for this purpose, as it can be seen in [15]. However, due to their short range, other types of tags are preferred. For example, Martinez-Sala et al. in [22] introduced UWB positioning technique, used to obtain a path to the destination, taking into account obstacles, walkable areas, or places of interest. In [33], a wearable system was introduced in which images captured by RGB camera were processed to find visual markers using Haar classifiers. In that work, ultrasonic sensor was also used for detecting obstacles. In [18], IR camera was used instead of ultrasonic sensor for obstacle detection. Such solution was able to provide a 3D map of observed environment, making it superior over approaches relying on proximity sensors. RGB cameras provide more information about the environment, and thus, their application can be found in systems directed to the visually impaired people. In a one of such applications [20], pie-shaped, large colour markers were recognised using a mobile device with a camera. Tapu et al. used bag of visual words with HOG descriptor [34] for detection of obstacles and category classification of observed objects. A computer vision approach with image matching based on Scale-Invariant Feature Transform (SIFT) [19] to localisation was proposed in [23]. Another descriptor, Speeded Up Robust Features (SURF) [5], was used in a banknote recognition system for the blind by Hasanuzzaman et al. [11].

In this paper, an approach to object recognition based on image matching is proposed. In order to provide higher recognition accuracy in shorter time than it can be achieved with popular floating-point and binary descriptors, a new binary descriptor is proposed. The descriptor performs binary tests between directional gradients of a small number of pixel blocks which are placed on four, scale-dependent image patches centred on the keypoint. The main novelty of the approach lies in the placement of pixel blocks within the patch, as well as in an arrangement of binary tests, which are performed between pixel blocks that belong to all selected patches. Since in this paper a localisation of a person is determined based on the labels of recognised images, two real-world, demanding datasets which contain labelled indoor scenes are introduced.

The rest of the paper is organised as follows. In Sect. 2, local keypoint descriptors which are often used in image matching tasks are presented, as well as the proposed binary descriptor. Section 3 covers evaluation of the approach on typical image benchmarks. The section also contains its comparison with state-of-the-art descriptors in image recognition tasks, having in mind their possible use in vision-based assistive technology for partially sighted people. Section 4 concludes the paper.

2 Proposed method

Local feature descriptors are often used in vision-based object recognition [11, 21], retrieval [6], or scene categorisation [38]. However, there is still a place for faster and more robust techniques, able to successfully describe and match images despite various transformations, distortions, or illumination conditions [6, 12, 24, 25].

SIFT [19] and SURF [5] descriptors are among the most commonly used floating-point techniques. They are also interest point detectors, using extrema of difference of Gaussians (SIFT) or the determinant of the Hessian (SURF). For keypoint description, SIFT uses spatial histogram of the image gradients, while SURF introduced many approximations to this approach, using Haar wavelet responses determined for a scale-dependent window, or integral images to speed up computations. Despite high-quality description provided by SIFT [16], this technique, and also SURF, suffers from long computation and matching time. Therefore, binary descriptors have been developed. Here, information carried by an image patch around the keypoint is transformed into a binary string using pairwise binary tests between some image regions, pixel blocks, or raw pixels, according to a sampling pattern. Such binary strings can be compared using Hamming distance implemented as fast bitwise XOR operation followed by a bit count. In Binary Robust Independent Elementary Features (BRIEF) [8], pairs of pixels are selected from uniform distribution. In Oriented FAST and Rotated BRIEF (ORB) [32], in turn, a machine learning approach determined the sampling pattern for BRIEF features. Here, rotation invariance is achieved using intensity centroid [32], and keypoints are determined with FAST detector [31]. Another descriptor, Binary Robust Invariant Scalable Keypoints (BRISK) [17], uses AGAST [17] for interest point detection and incorporates a circular sampling pattern. A retinal sampling pattern is used in Fast Retina Keypoint (FREAK) [2] technique. All these binary descriptors rely on intensity comparisons; therefore, having in mind well-performing, floating-point techniques which use image gradients, several new binary descriptors have been introduced. In Ordinal and Spatial information of Regional Invariants (OSRI) [39], binary tests on intensities and gradients of regional invariants are performed. However, OSRI suffers from long computation time of its 21576-bit string, which additionally has to be reduced. In BinBoost [35], gradient-based image features are used for training AdaBoost classifier. Binary tests are replaced by learned binary hash functions. Among recently introduced techniques, Binary Online Learned Descriptor (BOLD) [4] is independently optimised for each image patch, and Receptive Fields Descriptor (RFD) [9] thresholds fields’ responses of rectangular or Gaussian pooling regions. In Optimised Binary Robust fAst Features (OBRAF) [28], up to 12 image patches with different scale-dependent sizes are divided into \(3\times 3\) pixel blocks and then pairwise tests on intensities and directional gradients are performed. In that solution, the binary string is reduced using a simulated annealing algorithm, or only four patches are used, leaving intensity tests in a simplified version of this descriptor [29]. Local Difference Binary (LDB) [40] descriptor uses comparison of pixel blocks. There is one image patch with fixed size, divided into 4, 9, 16, and 25 blocks. LDB and OBRAF were coupled with SURF keypoints. Further extension of LDB, Accelerated-KAZE AKAZE [3], introduced scale invariance, using the keypoint’s scale for calculation of the size of the patch. In AKAZE, interest points are detected using Fast Explicit Diffusion [3].

Well-performing binary descriptors often require dimensionality reduction [28, 39] or learning which can be prone to the overfitting, e.g. BinBoost showed outstanding performance in patch-based benchmarks, while obtaining mediocre results in typical image matching tests [4, 9]. Furthermore, their computation time is close or longer than floating-point techniques, as it can be seen for AKAZE, LDB, BinBoost, or OSRI.

Fig. 1
figure 1

Patch partitioning strategy used in SBD, each patch contains five pixel blocks

In this paper, a novel binary descriptor is proposed which allows fast, robust, and scale- and rotation-invariant keypoint description by: (1) selection of scale-dependent patches around keypoint, (2) calculation of keypoint’s dominant orientation, (3) using a small number of pixel blocks per patch, and (4) performing binary tests on directional gradients that belong to different patches. The first two properties are present in many known solutions. The usage of scale-dependent patches seems to be an intuitive way of description of keypoints detected at different scales, which was confirmed in AKAZE or OBRAF. Interestingly, AKAZE and SURF share the size of the patch, which is equal to \(20\sigma \), where \(\sigma \) denotes the scale of the interest point. Estimation of the dominant orientation is often achieved using sums of Haar wavelet responses (SURF), rotation of the integral image or the grid (LDB, AKAZE), or using intensity moments approach (ORB). The proposed descriptor uses five pixel blocks per patch (20 blocks in total), OBRAF uses 99 blocks of pixels, BRAF 36, and AKAZE with LDB 54. It can be seen that the amount of information required to create the binary string is significantly smaller for the proposed technique than for other block-based descriptors. Furthermore, in contrary to them, the proposed descriptor, namely Simple Binary Descriptor (SBD), divides each patch into four disjunctive blocks (\(2\times 2\)) and adds one centre block of the same size. Figure 1 presents partitioning strategy introduced in SBD. In AKAZE, OBRAF, or LDB, all-against-all binary tests are performed between blocks that are placed on the same patch. SBD, in turn, performs binary tests on values obtained for blocks that belong to all selected patches. The values, i.e. gradients, are normalised in respect to the size of their blocks.

The creation pipeline of SBD can be described as follows. For each keypoint, \(n \in N\), detected on the image, four square image patches (\(P_i, i=1,\ldots ,4\)) are selected around it. The size of i-th patch, \(A_i\), is determined by the scale of the interest point (\(\sigma \)), i.e. \(A_i= M_i \times M_i \), where \(M_i = \{6\sigma , 12\sigma , 24\sigma , 48\sigma \}\). Then, i-th patch is divided into four square pixel blocks, \(B_j^i, j=1, \ldots , 4\) and one additional block is placed in the centre (\(B_{j=5}^i\)). Blocks are characterised by directional gradients, \(D_x\) and \(D_y\). Here, information on intensity present in most binary descriptors is not used to ensure shorter binary string. Directional gradients are obtained using integral images [5] and Haar-like box filters calculated for each block [3, 5]. The dominant orientation in SBD is calculated with the half of wavelet responses in horizontal and vertical directions used by SURF [5]. The computation of the binary string can be written as:

$$\begin{aligned} {\text{ SBD }} = \sum _ {1\le o\le 190} 2^{o-1}T_{D_x} + \sum _ {1\le o\le 190} 2^{o-1}T_{D_y}, \end{aligned}$$
(1)

where o denotes the pair of compared blocks (\(B_j^i(o)\) and \(B_k^l(o), j\ne k \wedge i\ne l; j,k=1,\ldots ,5; i,l=1,\ldots ,4\)), and the test is defined as:

$$\begin{aligned} T_{D} = \left\{ \begin{array}{@{}ll@{}} 1, &{}\quad {\text{ if }}\ \frac{D(B_j^i(o))}{\text{ size } (B_j^i(o))}<\frac{D(B_k^l(o))}{\text{ size } \left( B_k^l(o)\right) } \\ 0, &{}\quad {\text{ otherwise }}. \end{array}\right. \end{aligned}$$
(2)
Fig. 2
figure 2

Exemplary images from datasets: a Oxford [24] and Heinly et al. [12], b Phos [36], c the AH, d the DC, and e the BR [29]

3 Experiments

In this section, the influence of the parameters of the SBD on its performance in matching tests is presented. Then, the proposed binary descriptor is compared with state-of-the-art binary and floating-point descriptors on popular image benchmarks. Finally, three demanding, real-world datasets are used to assess a possible usage of compared descriptors in vision-based localisation approach for partially sighted people.

3.1 Influence of the parameters

There are two main parameters used in SBD creation pipeline: (1) the number of image patches and (2) the size of each image patch. In order to show how they influence the performance of SBD, matching tests on two popular image datasets were performed. In such a test, detected and described keypoints from two images are compared. Two keypoints are considered to be matched if the distance ratio between the first and the second closest keypoint is smaller than 0.8, taking into account three pixel localisation errors and 40% overlap [12, 24]. The area under Recall versus 1-Precision curve was used as the performance index. Precision expresses the number of verified matches to the returned matches, and Recall counts how many verified matches were found out of possible correct matches. In matching tests, 500 keypoints per image were detected using SURF and described with SBD, and then, threshold-based similarity matching was applied [5].

Oxford [24] and Heinly et al. [12] datasets were used in experiments. These popular benchmarks contain base images, as well as sequences of transformed images with known homographies between them. In datasets, there are images that exhibit a large amount of scaling, rotation, viewpoint change, blur, illumination changes, exposure, or compression. Figure 2a contains some images from these datasets. For each image pair, the area under Recall versus 1-Precision curve was calculated, and then, the mean value for all sequences from both datasets was provided as the measure of performance of the given set of SBD’s parameters. There are many possible combinations of the number of image patches centred on the keypoint and the relation of their sizes to the keypoint’s scale. In experiments, the number of patches was in the range [1, 4] and their sizes, expressed as the length of the patch’s side multiplied by the keypoint’s scale, was in [5, 50] range. The size of one patch was changed, while other patches were not used or the size of smaller patch was two times larger than the size of its predecessor starting from 5, e.g. in the case of three patches, M is equal to 5 and 10, for the first and the second patch, respectively. Since in this paper a new concept of binary tests between values that belong to different patches is introduced, this experiment was divided into two parts. At first, binary tests were performed only between pixel blocks which belong to the same patch, as in a typical block-based descriptor, and, in the second part, binary tests covered all blocks. Obtained results are presented in Fig. 3. It can be seen that the proposed approach was able to provide stable results disregarding the growing size of the examined patch. The number of patches had a positive influence on the performance of the resulted descriptor. Interestingly, the usage of binary tests performed on blocks from all patches led to the considerably better performance of the descriptor, which is shown in Fig. 3b. Such a gain in the performance was achieved due to the comparison of the areas that contain different amount of information. Furthermore, the results for more than two patches were better than results for compared state-of-the-art binary descriptors (see Sect. 3.2).

Fig. 3
figure 3

Influence of the number of image patches and their sizes on the mean area under Recall versus 1-Precision curves calculated for image sequences from Oxford [24] and Heinly et al. [12] datasets. The experiments were divided into two parts in which binary tests were performed between pixel blocks that belong to: a the same image patch and b to different patches. The second approach to the arrangement of binary tests is introduced in this paper

3.2 Comparative evaluation

Image matching benchmarks were also used to provide comparative evaluation of the SBD with state-of-the-art binary descriptors. SBD is implemented in Java, and thus, all available binary descriptors from BoofCV (http://boofcv.org/) [1] and javaCV (https://github.com/bytedeco/javacv) libraries were used, i.e. BRIEF, BRISK, ORB, and AKAZE; javaCV is a Java wrapper for widely used OpenCV library (C++). Furthermore, two floating-point descriptors, SURF and SIFT, were also used in tests, in order to show that SBD can outperform them running in a fraction of their description time. Binary descriptors were compared using Hamming distance, while floating-point counterparts were compared using Euclidean distance. SBD and BRIEF described SURF keypoints, SBD was also run with FAST keypoints, which were used by ORB, since fast interest point detection can be desired in some applications.

Table 1 Comparison of the approach with state-of-the-art binary and floating-point descriptors in matching tests on Oxford, Heinly et al., and Phos datasets, in terms of the mean area under Recall versus 1-Precision curves

Three datasets were used in these experiments. Oxford and Heinly et. al datasets contain mostly rotated and scaled images, and in order to provide more thorough evaluation of the descriptors against various illumination conditions, Phos dataset [36] was used. Phos contains 15 scenes captured changing the strength of uniform and degrees of non-uniform illumination. There are underexposed \((-)\) and overexposed images (+) in this dataset, and a strong directional light source was used for capturing non-uniform images.

Table 2 Comparison of the approach with state-of-the-art binary and floating-point descriptors in object recognition tests on the AH, the DC, and the BR datasets, in terms of the number of correctly recognised objects or places

The mean area under Recall versus 1—Precision curves was used to compare descriptors. Obtained results are presented in Table 1, and they reveal that the introduced binary descriptor, SBD, outperformed compared binary counterparts by a large margin, i.e. overall means reported for Oxford and Heinly et al. datasets for SBD with SURF and FAST keypoints were 1.5 and 1.36 times better than the result obtained by the best other binary descriptor (AKAZE). Mean values for each dataset bring similar observation. Taking into account floating-point, heavy solutions, SURF obtained the best overall mean, but was worse than SBD on Heinly et al. dataset. SIFT was better than other descriptors on this dataset; however, in general, it was outperformed by SBD with SURF keypoints. It can be seen that this version of SBD was only worse than BRIEF for image sequence with exposure (Leuven) and worse than AKAZE for two image sequences that contain rotated images (Bikes and Ceiling). For other image sequences, SBD with SURF keypoints clearly outperformed other descriptors, and for Boat Graffiti, Wall, and Day and Night sequences it was better than floating-point techniques. For Phos, BRIEF showed good performance, since here test images are not rotated, and applied binary tests, which are also present in other descriptors, were able to compensate illumination changes. SBD using FAST and SURF keypoints outperformed all other compared descriptors on this dataset. Comparing these two interest point detectors, it seems that SURF keypoints are less stable against illumination changes.

The implementation of SBD was run as single threaded on a CPU with Intel Core i5-5200u 2.2 GHz processor using 8 GB RAM, Java 8.0, and Microsoft Windows 7. The obtained description time, measured per keypoint, for the first image from Bikes sequence, was as follows: SURF 0.1403 ms, SIFT 0.7517 ms, BRIEF 0.0276 ms, ORB 0.0225 ms, AKAZE 0.1406 ms, BRISK 0.0425 ms, and SBD 0.029 ms. SBD was slightly slower than ORB and BRIEF, but it considerably outperformed them in tests. It was faster than BRISK and almost five times faster than AKAZE or SURF. SIFT was the slowest competing descriptor. Matching time depends on the length of the binary string and the number of detected keypoints, which was constant for all techniques. Here, only ORB with its 256-bit string was faster than SBD. They were followed by AKAZE (486 bits), BRISK, and BRIEF (512 bits), and by floating-point descriptors, for which matching time limits the number of their possible applications. The upright version of SBD, in which the dominant orientation of the keypoint is not used, was computed in 0.007 ms, which is almost four times faster than BRIEF, which also does not contain this step.

3.3 Application to vision-based localisation

In order to evaluate the usability of the developed binary descriptor in a vision-based assistive technology for supporting partially sighted people, a specific image datasets are required. They should contain images of building interiors, as well as images of outdoor objects or scenes. It can be assumed that images of such places or objects are labelled, and, upon recognition, their labels can be pronounced using text-to-speech technology. Therefore, three labelled image datasets were used in this paper. Two of them, the At Home (AH) and the Doors and Corridors (DC) datasets, were created for the needs of this study. They can be downloaded at http://www.marosz.kia.prz.edu.pl/datasets.html. The AH dataset contains 250 images taken at an apartment. There are 190 learning images and 60 test images; most of them are rotated (\(90^{\circ }\)). Here, labels refer to the part of the apartment and observed objects. The DC dataset, in turn, contains labelled images of corridors and doors captured at the Department of Computer and Control Engineering, at the Rzeszow University of Technology, Poland. There are 111 learning examples and 126 test images in this dataset. The third image collection, the Beautiful Rzeszow (BR) dataset [29], is much larger and contains 3000 images depicting 50 tourist attractions in Rzeszow, Poland. They were photographed varying the time of the day (day and night) and season (spring, autumn, and winter). The dataset is particularly challenging, since it covers many image transformations such as scale, viewpoint, and rotation. There are also difficult illumination changes and occlusions. Images captured at a different time of the day were used for testing. Exemplary images from these three datasets are shown in Fig. 2c–e.

SBD descriptor was compared with other state-of-the-art binary and floating-point techniques. The test images were recognised using k-nearest neighbour classifier (\(k=1\)) working on the number of returned matched descriptor pairs. Since all recognised images are indicating the localisation of a person, as well as seen objects, such image recognition approach can be used for supporting partially sighted people.

Obtained recognition results on three datasets are presented in Table 2. The number of used keypoints per image varied from 20 to 500 for the first two datasets and set to 500 for the BR dataset. For the AH dataset, SBD’s version using SURF keypoints was better than other descriptors. However, AKAZE achieved good performance with small number of detected keypoints, close to the results obtained by SBD with FAST interest points. Since only a part of images in this dataset are rotated, and scale change is small, BRIEF’s performance is worth noticing. Floating-point descriptors were better than SBD (FAST), AKAZE, and BRISK. The second dataset, the DC, turned out to be more difficult, since door images are very similar. Furthermore, it can be seen that matching-based recognition has difficulties in case of repetitive patterns. For this dataset, SBD on SURF keypoints, ORB and SIFT outperformed other descriptors. The recognition results obtained for the BR dataset show outstanding performance of SBD with SURF keypoints. Here, SBD recognised similar number of images as it is reported for SIFT, in a fraction of its description and matching time. Also, SBD with FAST keypoints performed as well as AKAZE, and better than other binary descriptors. Due to high robustness of the presented binary descriptor against illumination conditions, recognition results for images taken at night are much better than for other techniques. In general, SBD using SURF keypoints presented the best recognition accuracy.

4 Conclusion

In this paper, an approach to image recognition with binary features for the localisation purposes for supporting visually impaired people was considered. Since the matching-based image recognition performance with widely used binary descriptors is not satisfactory, as well as the computation and matching time of their floating-point, heavy counterparts, a new binary descriptor was introduced. SBD achieves fast computation time and is more robust to different image transformations than compared techniques. Its creation pipeline selects four scale-dependent image patches centred on a keypoint, covers them with five pixel blocks, and then performs binary tests on directional gradients calculated for blocks. In contrary to other block-based descriptors, the binary tests are also performed between values determined for blocks from different patches. The descriptor was evaluated and compared with state-of-the-art using three popular image benchmarks, as well as three real-world image collections with labelled images. Obtained results are promising; they confirm the usability of SBD for vision-based recognition and localisation.