A brain-inspired SLAM system based on ORB features

This paper describes a brain-inspired simultaneous localization and mapping (SLAM) system using oriented features from accelerated segment test and rotated binary robust independent elementary (ORB) features of RGB (red, green, blue) sensor for a mobile robot. The core SLAM system, dubbed RatSLAM, can construct a cognitive map using information of raw odometry and visual scenes in the path traveled. Different from existing RatSLAM system which only uses a simple vector to represent features of visual image, in this paper, we employ an efficient and very fast descriptor method, called ORB, to extract features from RGB images. Experiments show that these features are suitable to recognize the sequences of familiar visual scenes. Thus, while loop closure errors are detected, the descriptive features will help to modify the pose estimation by driving loop closure and localization in a map correction algorithm. Efficiency and robustness of our method are also demonstrated by comparing with different visual processing algorithms.


Introduction
Animals have an instinctive ability to explore and navigate in an unknown space.Inspired by the spatial cognition of animals, in the past decades, many researchers were investigating how animals perceive, store and maintain spatial knowledge [1−7] .Tolman [1] first thought that cognitive map may exist in rodent s brain in 1948.In 1971, O Keefe and Dostrovsky [2] found place cells, located in hippocampus, increase their firing rate when rodent is in a particular position in a maze.With the development of neurology, neurophysiological studies [3−6] then revealed that entorhinal cortex (EC) has an internal representation of the environment by spatially encoded cells identified as "grid cells".These cells are activated in a hexagonal pattern during spatial navigation and are also independent of external environment.Furthermore, experimental results [7] evidenced that hippocampus can receive all sensory neocortical information from EC due to connections between EC and hippocampus.These biological findings motivated researchers in robotics area to build a cognitive map by a process integrating activity from both the grid cells and place cells in the EC-hippocampal area [8−11] .
These findings not only help us have a deeper understanding of brain "GPS" (global positioning system) [12] , but also trigger research on brain-inspired simultaneous localization and mapping (SLAM) and relevant algorithms [13−16] .Though the mechanism of brain "GPS" still has many unknown factors, it can be safely claimed that the SLAM system of rodents is not only based on internal path integration, but also involves external visual cues of physical world.Rodent can update the pose by fusing the external visual information and the neural activities of place cells and grid cells.These behaviors are similar to the prediction and update process in SLAM.Different from the existing SLAM algorithms based on probabilistic methods [17,18] , RatSLAM [19] , a brain-inspired SLAM algorithm, creates a cognitive map based on visual information and the information of its previous experiences.Rat-SLAM simulates the rodent s spatial cognition mechanism using pose cells, local view cells and cognitive maps.Pose cells construct a continuous attractor network (CAN) which is suggested to demonstrate the path integration for grid cells in EC, local view cells functionally replace rodent s perceptual system and the cognitive map is an analogy of place cells in the hippocampus.In practice, Milford and Wyeth [19] successfully implemented a coherent map after mapping a 66 km suburb using only a single web camera.However, the vision algorithm of this method is not fit for an office environment with many similar scenes.In order to map an office environment, Tian et al. [20,21] took advantage of RGB-D (red, green, blue and depth) information to construct a spatial cognitive map of an office environment.Similarly, the above methods just simply extract intensity scanline profiles from visual information and make a comparison among these scanline profiles.
In past years, various computer vision technologies have been proposed to detect and describe local features, which are widely applied in visual SLAM algorithms [22−24] .Scale-invariant feature transform (SIFT) [25] is a popular and useful visual processing algorithm, which is highly invariant to image scaling and rotation, and even partially invariant to the illumination.Even over a decade, SIFT has been successfully applied in a set of applications.However, SIFT is not suitable for real time demand, because it takes massive time for calculating the scale factor.Therefore, many methods aimed at speeding up the computation of SIFT have been published [26−32] .
In particular, features from accelerated segment test (FAST) [28] and binary robust independent elementary features (BRIEF) [30] are two arguably outstanding methods.FAST uses machine learning to derive corner detector and provides a suitable way to finding key points in real-time systems, such as, parallel tracking and mapping (PTAM) [33] .BRIEF is a binary feature descriptor that relies on a relatively small number of intensity difference tests to represent an image patch.One great advantage of BRIEF descriptors is that they are very fast to compute and to compare.Gálvez-López and Tardos [34] proposed a real-time visual place recognition method by both FAST and BRIEF.Furthermore, by integrating FAST and BRIEF, Rublee et al. [35] developed a visual processing method named oriented FAST and rotated BRIEF (ORB), which is an efficient alternative to SIFT.ORB uses FAST algorithm to detect key points in images, filters the key features using Harris corner measure [36] and computes binary descriptors using oriented BRIEF.ORB can provide good performance with low cost computation, it is suitable to construct a semi-dense map in real-time [37] .
Inspired by these implementations of ORB and Rat-SLAM, ORB is adopted to replace RatSLAM algorithm s vision processing method in this paper.By fusion of RGB information and raw odometry information, a cognitive map can be built.This map contains a set of spatial coordinates that robot has traveled.In order to fully verify the effectiveness of our method, we also make a comparison with other visual processing method below.The experiment results show that ORB can significantly enhance robustness of the proposed SLAM system.This paper is organized as follows.Section 1 has introduced the background of brain-inspired SLAM and discussed the purpose of this paper.An overview is given in Section 2 to describe system architecture and robot platform.In Section 3, visual processing method is explained in detail.The cognitive model of brain-inspired SLAM is described in Section 4. Experimental results are demonstrated in Section 5 to show the effectiveness of our proposed method.Finally, Section 6 gives a conclusion.

System overview
Fig. 1 shows the hardware architecture of robot in experiment.The robot consists of a mobile base, a RGB-D camera and a miniPC.We use Pioneer 3-DX as mobile base.The mobile base is a compact differential-drive mobile robot, which includes a motion controller with built-in encoders.The robot s embedded motion controller performs velocity control of the robot and offers odometry information.A RGB-D camera is mounted on mobile base to capture visual information of environment.The miniPC is used to process visual information, odometry information and cognitive model.A schematic overview of the cognitive map building system is shown in Fig. 2. The system mainly consists of visual processing module and cognitive map model.The visual processing module is responsible for extracting ORB features.In cognitive map model, the ORB features are associated to local view cells which represent existing cues at specific location.Raw odometry information from mobile base is involved in path integration process.The internal CAN dynamics ensures cognitive map model to converge at a steady state even without external sensors.Eventually, local view cells activity, path integration process and CAN dynamics work together to build a cognitive map.

Visual processing
Visual processing is an irreplaceable part of our SLAM algorithm.During mapping process, visual information is adopted to determine the generation of local view cells and eliminate redundant images; while in global localization process, robot captures incoming frames and compares them with recorded visual templates to determine its location.In previous implementation, Milford and Wyeth [19] built a coherent map on large suburban areas.Milford s vision algorithm adopts a scanline intensity profile to deduce the translation and rotation with previously seen visual information.The scanline intensity profile is a onedimensional vector by summing pixels in column and then normalizing the vector.However, this processing way is imprecise and not a good fit for an office environment with many similar scenes.In this paper, we use ORB algorithm to replace the simple vision algorithm.The method is proved to be well-performing in experiments which will be illustrated in Section 5. ORB is a combination of FAST detector and oriented BRIEF descriptor with modifications to enhance the performance.As the classical visual processing algorithm, FAST is very useful to detect key points.However, FAST cannot produce multi-scale features and does not have an orientation component.To solve the former limitation, a scale pyramid of the image was employed to generate FAST features at each level in the pyramid.For latter limitation, the major improvement over FAST is that the intensity centroid is introduced to measure corner orientation.As for descriptor, BRIEF is one of the fastest available feature descriptors for calculation and matching, however it is variant to in-plane rotation.In order to address this limitation, Rublee adopts a steered BRIEF to improve the performance.The BRIEF descriptors in ORB should have two properties: lower correlation and high variance among binary tests.Low correlation among binary tests indicates each new test will bring new information into the descriptors, thus maximizing the amount of information recorded in descriptor.High variance among binary tests makes features of image more discriminative.Rublee developed a learning method to ensure descriptors have these two properties.This method searches for a set of uncorrelated binary tests in an image patch, as shown in Fig. 3.
Fig. 3 presents a flow chart of the learning method in ORB to choose a good set of binary features with high variance and low correlation.In this method, all possible binary tests in an image patch are firstly ordered to obtain high variance binary tests.After that, a greedy search process is implemented to filter the high correlation binary tests.Finally, the selected binary tests form an ORB descriptor.

Fig. 3 Flow chart of learning method to choose a good set of binary features with high variance and low correlation
By the integration of FAST and BRIEF approaches, ORB method has been proposed.Experimental results showed that for the same number of features, the processing time of ORB is faster than speeded up robust features (SURF) by an order of magnitude and over two orders faster than SIFT.ORB was used to achieve the good performance for place recognition [37,38] .Comparing with the other feature descriptions such as SIFT, SURF and FAST, major advantages of ORB include: 1) It provides a fast calculation and matching algorithm.2) ORB retains the ability of invariance to image scaling and rotation.3) ORB is relatively immune to image noise, lighting and perspective distortion.
Thus, ORB method is employed in this paper to extract features from images as visual template.When the current image matches prior visual templates, it is considered that robot reached this place previously.Otherwise, a new visual template is added to local view cells, as shown in Fig. 4.

Cognitive map model
The cognitive map model mainly contains several components: local view cells, CAN dynamics, path integration process and cognitive map construction process.In this section, we give a description of these components in detail.

Local view cells
Local view cells are constructed as a vector, in which each element is associated with a cell unit in CAN.In visual processing module, if visual templates are sufficiently similar to recorded view templates, cognitive map model will inject an energy to associated cell unit.In contrast, significantly different images are treated as new templates and pushed into the vector.The inject energy will produce an activity packet at inactive zone in CAN model.This will help system to perform path integration and correct loop closure in system.This change in cognitive map model caused from local view is described as where β is a connection matrix from local view cells Vi to associated CAN cells P .

CAN dynamics
The core of the cognitive map model is a continuous attractor neural network, known as pose cell model.This model is widely used to simulate brain "GPS".In Rat-SLAM, pose cells are arranged in (x, y, θ) coordinates, (x, y) is corresponding to a location in pose cell planes, and θ represents robot head direction.In addition, the dynamics of CAN structure ensures the activity P in pose cells to remain steady by following phases: an excitatory update, an inhibition and a normalisation.

Excitatory update
In this phase, a three-dimensional Gaussian distribution is employed to create an excitatory weight matrix ε, which drives activity P from each cell to all other cells in the pose cells matrix.The distribution is defined by where σxy and σ θ are variances for the distribution in (x, y) plane and θ dimension.Indexes a, b and c mean x, y, θ are within the distribution in pose cells co-ordinates.Then, the change of activity ΔP x,y,θ in pose cells is given by where Dim X, Dim Y and Dim θ are three dimensions of pose cells, we set Dim X = Dim Y = 60 and Dim θ = 36 in our experiment.

Inhibition
Each cell also restrains neighboring cells via an inhibitory weight matrix ξ. ξ has the same form as excitatory weight matrix, but non-positive weights.After excitatory phase, the opposite inhibition will result in convergence in network.By performing inhibition and adding a global inhibition μ, the change of activity is given by where ξ is inhibitory weight matrix and activity P in pose cell model is limited to non-negative value.

Normalization
Finally, the normalization will ensure the number of activity P in pose cells to remain at one.The activity after normalization is x,y,θ = P t x,y,θ DimX (5)

Path integration
The path integration shifts the pose cell activity away from the existing one based on raw odometry information.This process eliminated the need of tuning parameters in pose cell model.Furthermore, path integration process is independent of volatile robot velocity and sensory information.The activity update process is defined as where ρx, ρy and ρ θ are integer offset compared to current activity position (xi, yi, θi).The activity P t+1 x i y i θi depends on previous time step activity P t x i y i θi and a residue component α.The residue component is based on the fractional part of the offset.The integer offset is calculated by where υ is translational velocity, ω represents rotation velocity and θi is the preferred cell orientation.λx, λy and λ θ are path integration constants.

Cognitive map construction
Cognitive map is a topological map that represents spatial relationship among landmarks of an environment.Each point in cognitive map can be represented by a triple ei = {Pi, Vi, pi} (8)   where ei, called experience, is a point in cognitive map.
Pose cells Pi and local view cells Vi are associated to ei.Finally, p i is the location of ei within map space.In this paper, if the distance between the pose cells of existing experiences and current pose cell reaches a threshold, then a new experience is created: where the translation Δpij between two experiences is obtained from mobile base.Finally, when cognitive map model detects a loop closure in experience ei, the change in the position of experience ei is obtained by ) where α = 0.5 is a correction rate constant, N f is the number of links from experience ei to other experiences, Nt is the number of links from other experiences to experience ei.

Experiment result
We evaluate the different aspects of our proposed algorithm in the following sections.In Section 5.1, we compare ORB algorithm with SIFT and the effectiveness of adopting RGB-D information.Next, we present the mapping result of our method in Section 5.2.In Section 5.3, we evaluate the performance of localization ability when facing noise data.The experiment data [20] comes from an office in Singapore Research Institute, which contains many corridors, uniform furniture and moving people.

Feature extraction
In this section, we demonstrate ORB performance by making a comparison between SIFT and Tian s method [20] .SIFT is one of the most popular feature extraction methods and a good fit for visual SLAM.Tian s visual processing method adopted RGB-D information to build a cognitive map successfully.To better illustrate the ability of ORB, SIFT and Tian s method, we select some events from experiment.
We firstly show the main idea of visual processing method of Tian s work based on RGB-D information in Fig. 5. From Fig. 5, intensity profiles of neighboring environment scenes are firstly extracted.Then, these profiles are processed by a sum of absolute differences (SAD) method to get the distance di.This distance is the sum of absolute differences between pixels value in these intensity profiles.Each distance di from both RGB and depth frames are finally assigned by using different weights to construct the distance d.It extracts one dimensional intensity profiles from both RGB and depth images.And these one dimensional intensity profiles are processed to calculate the distance between current image and recorded visual templates in local view cells.This method discards much information from physical environment, although it is simple and efficient.Unlike the above method, SIFT and ORB provided a better way to extract features from incoming RGB frame to describe the surrounding environment.
In Fig. 6, image features based on SIFT and ORB are presented with circles and matched features are connected with lines.In the first row, the extracted features of neighboring frames are shown from both SIFT and ORB methods.It is obvious that these two methods provided the similar image matching capability.In the second row, we specially selected a loop-closure circumstance from the experiment, the result shows that both methods can verify that these two images captured in different time representing the same location in the map.Different from SIFT, ORB is more suitable to extract features in objects that are at a middle and farther distance.Furthermore, one of the advantages of ORB method is its higher efficiency on a standard CPU, this can be compared with SIFT which needs a lot of calculation time.In our experiment, the feature extraction process was executed in a single thread running on an Intel i7 3.6 GHz processor.The parameters of ORB implementation are to extract 500 key points, at 8 different pyramid levels with a scale factor of 1.2.
We compare the processing time for both SIFT and ORB by testing them on our experiment data set including 3 732 frames.The comparison results are summarized in Table 1.
Our results show that ORB features are almost as reliable as SIFT features for detecting loop closure and matching neighboring frames.As advantages, ORB features not only are invariant to image scale and rotation, but also are much faster to process, speeding up the visual processing method.

Cognitive map
In this section, cognitive mapping results are showed to demonstrate the performance of our proposed method.As shown in Fig. 7, the mapping process includes three major steps: 1) The external RGB images and internal selfmotion are used to generate and shift active packets in the pose cell model.2) Local view cells correct accumulative errors caused by the path integration when a loop closure is formed.3) The cognitive map will persistently update throughout the process.In the following, we will describe these processes in the local view cells, pose cells and cognitive map construction.

Local view cell activity
The learned view template versus frame generated from RatSLAM using one-dimensional profiles and ORB features are showed in this section.The local view cell learned 1 365 visual templates during mapping process, as shown in Fig. 8 (a).The graph shows the learned visual template versus frame generated from ORB based visual processing module.The y-axis indicates the addition of learned visual templates and x-axis indicates the number of incoming frames captured by camera.Duration of without new template means that either the robot travels along places that have been learned, or the robot stays static.Furthermore, the number of newly added local view cells will gradually decrease after robot loops the environment several times.Compared with Fig. 8 (b), which is generated from one dimensional intensity profiles [19] , it is obvious that many false positives have been produced, directly leading to wrong loop closure and failures in spatial representation.Hence, it is desirable to use ORB to construct a more accurate map.

Pose cell activity
During mapping process, there is an active packet in pose cell model.The active packet represents robot s location.Normally, it moves around according to self-motion information from mobile base.However, there are two typical events in the pose cell model during mapping process, as shown in Fig. 9. Generally, the number of cell units in pose cell model is inadequate to present all corresponding places in physical environment.There is a one-to-many relationship: One cell unit can have many physical environment places associated to it.Therefore, pose cells are arranged with a recurrent structure to keep continuity of pose cell model.Hence, sometimes partially activity packet makes a leap to opposite face in the pose cell, as shown in Fig. 9 (a).
Local view cells play an important role in loop closure process.When robot enters a place that has been reached before, active local view cells transmit an energy to associated pose cell.Fig. 9 (b) and Fig. 9 (c) show that system encounters a familiar scene, and then local view cells inject energy to pose cells, eventually, create two peaks in pose cells.However, the pose cell model is capable of maintaining or correcting its beliefs, the dynamics of pose cells filters ambiguity caused by the similar visual cues and converges pose cell model into one single peak.

Cognitive map construction
The mapping system does not track visual features across several frames and does not take geometric relationship of landmarks into consideration.It only connects and records the simple spatial relationship of places in environment.
However, the map will constantly do map correction during the whole process.When the robot reaches a familiar environment scene, active local view cell injects energy to associated pose cell, causing the robot to localise its current pose.These behaviors prevent the robot from losing its way, as shown in Figs. 10 (a) and 10 (b).In Fig. 10 (a), it is obvious that robot falsely locates itself when it returns to previously explored area.This is because self-motion information from mobile base contains errors, directly leading to the "drift" phenomenon.But, RatSLAM can detect loop closure and reset its position, after local view cell stimulates associated pose cell in Fig. 10 (b).
For comparison purpose, a map is built by one dimensional intensity profiles [19] , as shown in Fig. 10 (d).It has been observed that this method failed to perform loop closure and is unable to construct a correct map of office environment.And conversely, in Fig. 10 (c), RatSLAM can eliminate the ambiguous scenes and correct loop closure, eventually, construct a correct cognitive map with the help of ORB features.

Global localization
In the above section, we have studied the effectiveness of different visual processing methods for cognitive map building.In this section, how different visual processing methods affect localization results will be focused on.In order to test the ability of localization, we randomly selected 1 244 frames from the raw database as the test data set.From Table 1 in Section 5.1, SIFT algorithm needs to spend about 200 ms extracting features even for each frame.This slow calculation makes it hard to fit real-time robotics applications.So we only compare the localization results based on ORB and Tian s method [20] in the following experiment.The experiment results include two cases: 1) Images are clean without any noise.
2) Images are simulated to rotate with small degrees.For the first case, the comparison results are summarized in Table 2. From Table 2, it is observed that a satisfactory accuracy of localization based on ORB method can be achieved which only requires RGB information of images (about 99.6%).This accuracy is almost the same as Tian s method in which both RGB and depth information are used together.It is concluded that ORB is almost as reliable as adopting RGB-D information.These results also demonstrate that ORB features are capable of distinguishing various scenes in environment.
For the second case, in order to study the robustness of ORB features, we first simulate images with noise by rotating the clean images with a few small orientations as shown in Fig. 11.
The localization performance for the simulated noise is given in Table 3. Table 3 shows that both ORB and Tian s method can provide the good localization performance with the accuracy above 95% when images are rotated by an angle up to 7 degrees.This implies that ORB features are capable of dealing with perceptual ambiguity in office environment without depth information.Tian [20] RGB-D 1 244 4 99.7

Conclusions
In this paper, we applied ORB feature extraction approach into RatSLAM system to build a cognitive map for a mobile robot.In order to test ORB method, the performance of extracted features has been demonstrated by comparing a few different visual processing approaches.We also compared the performance of different visual processing methods such as SIFT approach and existing feature extraction methods [19,20] for both RGB and RGB-D signals.The feature extraction section showed that ORB is not only suitable for matching neighboring frames with in-plane camera motion, but also satisfying real-time visual processing requirement.In global localization section, the proposed brain-inspired SLAM system is able to realize the localization in the map with the help of ORB features.Furthermore, the cognitive mapping results verified that ORB can significantly enhance the robustness of SLAM system in indoor environment.Specifically, the proposed SLAM system can greatly reduce false positives and repeatedly correct loop closure even facing accumulative odometry error, and eventually, constructs a cognitive map.

Open Access
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Fig. 4
Fig. 4 Templates matching process for local view cells.ORB features are extracted from environment scenes as visual template which is compared against all visual templates associated with local view cells Lvi.When current visual template matches a prior template in visual templates, the associated local view Lvi fires and injects energy to cognitive map model.Otherwise, add a new local view cell Lvj to Lv.

Fig. 5 A
Fig. 5 A pair of neighboring RGB-D frames is showed.The top row is RGB information.The bottom row is depth information.Both RGB and depth information are captured simultaneously.And then, intensity profiles of neighboring environment scenes are extracted.These intensity profiles are processed by a sum of absolute difference to get a distance di.Distances di from RGB and depth images are weighted and then contributed to d, which is used to distinguish different scenes.

Fig. 6
Fig. 6 Example of features matched by using SIFT (pair on the left) and ORB descriptors (pair on the right) in the first and second rows.The first row presents neighboring frames captured by camera.The second row presents a loop closure event in experiment.It is obvious that both SIFT and ORB can provide sufficient corresponding matched points.

Fig. 8
Fig. 8 Activity of local view cells during mapping process.Graphs (a) and (b) are view template versus frame generated from ORB features and intensity profiles, respectively.While making a comparison with (a) and (b), it is obvious that ORB can help RatSLAM significantly reduce the number of false positives.

Fig. 9
Fig. 9 Activity packets are present by the orange zone, the lighter part means higher activation level.In (a), one active packet shifts to opposite face to keep the consistency active packet.In (b), when the robot recognizes a sufficient similar environment scene, local view cells inject energy into pose cells and then two active packets emerge in pose cells.In (c), only one packet has survived after both two activity packets undergo CAN internal dynamics.

Fig. 11
Fig. 11 This figure shows a snapshot of environment.(a) Original image selected from experiment.(b), (c) and (d) are environment scenes and are clockwise rotated by 3, 5 and 7 degrees, respectively.With the increase of rotation angles, it is more difficult for system to locate itself within the map.

Table 1
Comparison of visual processing time

Table 2
Comparison of localization result

Table 3
Localization result (with rotation noise)