An experimental evaluation of Siamese Neural Networks for robot localization using omnidirectional imaging in indoor environments

Cabrera, Juan José; Román, Vicente; Gil, Arturo; Reinoso, Oscar; Payá, Luis

doi:10.1007/s10462-024-10840-0

An experimental evaluation of Siamese Neural Networks for robot localization using omnidirectional imaging in indoor environments

Open access
Published: 08 July 2024

Volume 57, article number 198, (2024)
Cite this article

Download PDF

You have full access to this open access article

Artificial Intelligence Review Aims and scope Submit manuscript

An experimental evaluation of Siamese Neural Networks for robot localization using omnidirectional imaging in indoor environments

Download PDF

Juan José Cabrera¹,
Vicente Román¹,
Arturo Gil¹,
Oscar Reinoso^1,2 &
…
Luis Payá¹

196 Accesses
1 Altmetric
Explore all metrics

Abstract

The objective of this paper is to address the localization problem using omnidirectional images captured by a catadioptric vision system mounted on the robot. For this purpose, we explore the potential of Siamese Neural Networks for modeling indoor environments using panoramic images as the unique source of information. Siamese Neural Networks are characterized by their ability to generate a similarity function between two input data, in this case, between two panoramic images. In this study, Siamese Neural Networks composed of two Convolutional Neural Networks (CNNs) are used. The output of each CNN is a descriptor which is used to characterize each image. The dissimilarity of the images is computed by measuring the distance between these descriptors. This fact makes Siamese Neural Networks particularly suitable to perform image retrieval tasks. First, we evaluate an initial task strongly related to localization that consists in detecting whether two images have been captured in the same or in different rooms. Next, we assess Siamese Neural Networks in the context of a global localization problem. The results outperform previous techniques for solving the localization task using the COLD-Freiburg dataset, in a variety of lighting conditions, specially when using images captured in cloudy and night conditions.

Training, Optimization and Validation of a CNN for Room Retrieval and Description of Omnidirectional Images

Article Open access 05 May 2022

Environment modeling and localization from datasets of omnidirectional scenes using machine learning techniques

Article Open access 20 April 2023

Analysis of Data Augmentation Techniques for Mobile Robots Localization by Means of Convolutional Neural Networks

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

During the past few years, vision sensors have been used extensively in the field of map building and localization with mobile robots (Hu et al. 2020; Zhong et al. 2018). In particular, the ability to localize in the map is of paramount importance in order to develop autonomous robots that can navigate in real operating conditions. The interest in using vision sensors to capture information from the environment is still high. Cameras can capture a big amount of information from the environment with a relatively low cost and they can be used in both, indoor and outdoor areas. Additionally, the images permit carrying out other highly specialized tasks such as object recognition and people detection.

Among the available configurations to capture visual information, the use of omnidirectional vision sensors in mobile robotics has become common. Omnidirectional cameras obtain images that cover a field of view of 360° around the robot (Junior et al. 2016). As a result, they are commonly used to address navigation tasks (Rituerto et al. 2010).

The large amount of information provided by cameras requires robust techniques to extract and describe the relevant visual information. Different paradigms have been considered to extract this relevant information. A first group of techniques concentrate on detecting, describing and tracking some landmarks or local features along the scenes (Cao et al. 2020; Lin et al. 2020). Different local features have been used in mapping and localization tasks, including SIFT, SURF and ORB descriptors (E. Rublee and Bradski 2011). A global description of each image can then be obtained, for example, by means of the Bag of Words model (Raúl Mur-Artal and Tardós 2015). A second group of techniques work with each scene as a whole, and build a unique descriptor per image that contains information on its global appearance (Korrapati and Mezouar 2017; Khaliq et al. 2019). Finally, hardware developments have led many authors to use Artificial Intelligence (AI) techniques to extract relevant information from images. Specifically, Convolutional Neural Networks (CNNs) have been proposed to address different computer vision and robotics tasks. For example, Xu et al. (2019) and Leyva-Vallina et al. (2019) proposed global appearance descriptors based on a CNN to obtain the most probable pose of the robot.

In general terms, holistic description methods lead to maps in which a set of robot poses and their associated descriptors are stored. In this way, each pose of the robot is represented by a holistic descriptor and this representation leads to straightforward localization algorithms, based on the pairwise comparison between descriptors.

In this manuscript we assess the usage of Siamese Neural Networks in the context of image description and robot localization. Siamese Neural Networks permit evaluating two images at the same time in such a way that they provide a similarity measurement at the output. Therefore, they have the potential to address visual recognition of places and estimate the position of a mobile robot. In the present paper, we evaluate this potential. The main contributions of this paper can be summarized as follows.

1.
We explore the capability of Siamese Neural Networks for modeling indoor environments, using panoramic images as the unique source of information.
2.
We train and evaluate Siamese Neural Networks with the purpose of detecting whether two images have been captured in the same or in different rooms.
3.
We train Siamese Neural Networks capable of estimating robot position as a global image retrieval problem.
4.
We conduct an exhaustive study on the influence of the Siamese Neural Networks’ architecture and the most relevant parameters. Moreover, we analyze the robustness against some common visual phenomena that may occur in real operating conditions, such as changes of the lighting conditions or image blur.

The following sections are structured as follows. First, in Sect. 2 we present a review of the state of the art on visual localization and mapping using Artificial Intelligence techniques. Second, in Sect. 3 we introduce Siamese Neural Networks for both room discrimination and global localization. After that, Sect. 4 presents the CNN architectures, the dataset and the proposed data augmentation. Furthermore, in this section we also describe the proposed method for room discrimination and global localization by means of Siamese Neural Networks. Then, Sect. 5 describes the experiments carried out to test and validate the proposed method. Finally, conclusions and future works are outlined in Sect. 6.

2 State of the art

As stated before, Siamese Neural Networks are able to generate a similarity function from pairs of input data. They can be regarded as a superstructure that includes two Neural Networks. These architectures accept two different inputs and offer a single output. The underlying networks share the same weights and different functions can be used to conform a single output. They were first proposed in 1993 in order to distinguish correct signatures from forgeries (Bromley et al. 1993). Since then, these architectures have been proposed in different areas of knowledge. For example, Thiolliere et al. (2015) proposed a Siamese Neural Network for audio and speech signal processing, Zheng et al. (2019) used this architecture for the comparison of DNA sequences or Jeon et al. (2019) used it for drug discovery purposes. Furthermore, Parajuli et al. (2017) developed a Siamese Neural Network to track cardiac motion and Sandouk and Chen (2017) proposed a Siamese architecture in order to recognize music tags. Recently, Suljagic et al. (2022) use this kind of architecture for multi-object tracking (MOT) and person re-identification.

During the past few years, AI in general and CNNs in particular have been used in the field of mobile robotics for a variety of purposes. For instance, for mapping (Sinha et al. 2018; Moolan-Feroze et al. 2019), localization (Weinzaepfel et al. 2019; Cattaneo et al. 2019), navigation (Zhao et al. 2018; Ma et al. 2019) and simultaneous localization and mapping (Lu and Lu 2019; Liu et al. 2019). A complete state-of-the-art review on mobile robotics tasks based on the use of AI can be found in (Cebollada et al. 2020). Other applications of AI in the context of mobile robotics include: self-driving navigation (Polvara et al. 2018; Organisciak et al. 2020), face detection and recognition (Wang et al. 2017; Hu et al. 2021), object recognition and categorization (Zaki et al. 2019; Feng et al. 2020) and mapping and localization (Holliday and Dudek 2018; Ruan et al. 2019).

Convolutional Neural Networks (CNNs) are the most popular techniques among AI tools. Currently, they are used in many mapping and localization tasks due to their successful performance in many practical applications. They are designed to receive images as input and their structures are specially created to obtain descriptors that synthesize the information in them (Chollet et al. 2018). Therefore, they can be used to describe the global appearance of an image. In this sense, Cebollada et al. (2019) proposed holistic descriptors obtained with a CNN to perform localization within topological models, studying their strength against illumination variations. Also, Xu et al. (2019) and Leyva-Vallina et al. (2019) proposed these techniques to obtain the most probable robot position. Additionally, Ballesta et al. (2021) studied localization tasks using CNNs and regression layers as global appearance descriptors. Recently, Rostkowska and Skrzypczyński (2023) employed the EfficientNet model (Tan and Le 2019) to embed an omnidirectional image into a single descriptor followed by a K-Nearest Neighbours (KNN) algorithm to robustly predict the topological position in a given database (map). In this regard, this work implements the Facebook AI Similarity Search (FAISS) library (Johnson et al. 2019) to efficiently perform the nearest neighbour search using a KD-Tree.

Some well known architectures have been used as basic structures to develop new modified networks for robotic navigation purposes. AlexNet (Krizhevsky et al. 2012), VGG16 (Simonyan and Zisserman 2014), GoogleNet (Szegedy et al. 2015) or NetVLAD (Arandjelovic et al. 2016) are some of them.

The Convolutional Neural Networks presented above can be used to form a Siamese Neural Network. In the field of robotics, they have been rarely used and some studies that proposed this structure in this field are mentioned below. For example, Utkin et al. (2017) use a Siamese Neural Network to support the security control of a robot by detecting anomalies in its behaviour and Zeng et al. (2018) present a robotic pick-and-place system capable of identifying and grasping both known and novel objects in cluttered environments using a Siamese Neural Network. Moreover, Li and Zhang (2019) use the VGG16 network to conform a Siamese structure for object detection and tracking. Additionally, Zhang and Peng (2019) presented a study in which Siamese Networks are followed by Fully Connected layers or Region Proposal Network structures in the context of real-time visual tracking.

Regarding robot localization tasks, Leyva-Vallina et al. have proposed the use of Siamese Neural Networks to address the place recognition problem in garden environments (Leyva-Vallina et al. 2019, 2021). Moreover, this architecture has been proposed for localization using LiDAR scans (Yin et al. 2018; Chen et al. 2022).

In the present paper, we address the localization of a mobile robot using panoramic images in such a way that we study in detail different architectures and training configurations of Siamese Neural Networks. For this purpose, we propose as an initial approach to train and test the capability of the network to distinguish between images captured in the same and different rooms. In addition, in this study we also tackle the global localization problem using Siamese Neural Networks.

3 Visual localization using Siamese Neural Networks

Siamese Neural Networks can be described as a superstructure that includes, at least, two different Neural Networks beneath. Weights are shared between the networks and a single output is generated by combining the outputs of both networks. Figure 1 shows a general representation of a Siamese Neural Network architecture. In the present work, we use Convolutional Neural Networks to conform the two branches of the Siamese Neural Network. The output of each CNN is a descriptor which is used to characterize each input image. The dissimilarity of the input images is computed by measuring the distance between these descriptors. In this way, Siamese Neural Networks can be trained to generate similar descriptors when the training images belong to the same category. This fact makes Siamese Neural Networks particularly suitable to perform image retrieval tasks. Additionally, it is worth noting that the outputs, training, and performance of the network depend directly on:

The architectures used in subnetworks W1 and W2 to extract the main features of the images.
The conversion of the feature maps from the convolutional layers to a descriptor vector.
The dimension of the output descriptors that embed the pair of input images.
The training carried out with the available images. In particular, the labelling and the ratio of images of each category.
Fig. 1
Representation of the architecture of a general Siamese Neural Network
Full size image

In this manuscript, we analyze the influence of these items on the visual localization of the robot. In this sense, we assume that a visual map of the environment is initially available. To obtain this map, the robot has moved throughout the area capturing omnidirectional images along the trajectory. Firstly, the images are transformed to a panoramic format (with size 128x512 in the present work), resulting in the set $\{I_1,I_2,\ldots ,I_N\}$. These images are captured from N points of view, whose poses are known and stored $\vec {P_i}=(x_i,y_i,\theta _i),i=1,\ldots ,N$. Additionally the room where the picture has been captured is known too, so a set of labels is available: $\vec {R}_i=(r_i),i=1,\ldots ,N$. Each image will be embedded into a single descriptor during the localization, using the proposed architecture, yielding $\{\vec {f}_1,\vec {f}_2,\ldots ,\vec {f}_N\}$. The trajectory followed by the robot includes different rooms with different visual information. In this work, these rooms include a corridor, some offices, a library and a bathroom.

Taking these facts into account, the initial map is composed by the set of images, their poses and the room in which the images are captured $\{(I_1,\vec {P}_1,r_1),(I_2,\vec {P}_2,r_2),\ldots ,(I_N,\vec {P}_N,r_N)\}$. Using this information, some Siamese Neural Networks are trained to address localization.

3.1 Room discrimination

In this subsection an initial task related to localization is evaluated to study whether a Siamese Neural Network is able to distinguish between images captured from the same or from different rooms. For this purpose, the model will be trained and tested with pairs of random images captured from the same and/or different room.

3.2 Global localization

In this study we consider that a map of the environment is available, as described before. The absolute localization problem is solved by comparing the test image directly with all the images in the map. This comparison is performed using the descriptors $\vec {f}_i$ associated to each image in the map. The pose of the robot is found as the most similar descriptor contained in that map. The problem is approached with pure visual information and assuming that no information about the previous pose of the robot is available.

4 Architecture and training of the deep learning tools

The structure of a classical CNN used for classification tasks can be split into two different stages (Cebollada et al. 2019): the feature learning and the classification stages. Features are extracted using several convolutional layers whereas the classification task can be constructed using fully connected layers and a final Softmax function. In our approach, the classification stage is replaced by a feature aggregation phase. In this sense, the feature extraction phase outputs multiple feature maps which are flattened to a vector and dimensionally reduced by fully connected layers. This phase permits generating a single description vector per input image. As a result, the model provides two vectors $\vec {f}_0$ and $\vec {f}_1$ (one per input image). These descriptors are compared using the Euclidean distance in the comparison phase $(d(\vec {f}_0,\vec {f}_1)=\Vert \vec {f}_0-\vec {f}_1\Vert _2)$. This architecture is shown in Fig. 2. Therefore, during training, the weights of the networks are updated in order to obtain the optimal global descriptors. After the comparison, the distance between them and the similarity label $(1:dissimilar, 0:similar)$ are used as data for the loss function. In our case the loss function used is the Constrastive Loss Function.

$$\begin{aligned} L(\vec {f}_0,\vec {f}_1)=\frac{1}{2}(1-y)d(\vec {f}_0,\vec {f}_1)^2+ y\frac{1}{2}max(\alpha -d(\vec {f}_0,\vec {f}_1),0)^2 \end{aligned}$$

(1)

Where $y$ is the similarity label and $\alpha > 0$ is a margin. The margin defines a radius around the descriptor so that dissimilar pairs of images contribute to the loss function only if their distance is within this radius (Hadsell et al. 2006).

4.1 Parameters and networks

In this manuscript we compare different networks in the feature learning stage. As inputs to the feature aggregation stage we consider the representation computed in the last convolutional layer of Alexnet (Krizhevsky et al. 2012), DenseNet (He et al. 2016), VGG11, VGG13, VGG16 and VGG19 (Simonyan and Zisserman 2014). AlexNet is a pioneering CNN architecture known for its success in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012. Visual Geometry Group (VGG) networks further contributed to the advancement of image classification problem, outperforming benchmarks on a variety of tasks and datasets outside of ImageNet (Bayraktar et al. 2019, 2020). The main difference between VGG networks is the number of convolutional layers: 11, 13, 16 and 19 layers respectively. In Table 1 the feature extraction layers of those CNNs are presented. Additionally two simple networks created with three conv2d layers are also evaluated (Table 2). The ReLU activation layers are not shown for brevity, but they have been used after each conv2d layer. The feature extraction layers are shown with black color in Tables 1 and 2. The different feature learning structures are evaluated in the Sect. 5.

Table 1 Configuration of the feature extraction neural networks. (Color table online)

An experimental evaluation of Siamese Neural Networks for robot localization using omnidirectional imaging in indoor environments

Abstract

Similar content being viewed by others

Training, Optimization and Validation of a CNN for Room Retrieval and Description of Omnidirectional Images

Environment modeling and localization from datasets of omnidirectional scenes using machine learning techniques

Analysis of Data Augmentation Techniques for Mobile Robots Localization by Means of Convolutional Neural Networks

1 Introduction

2 State of the art

3 Visual localization using Siamese Neural Networks

3.1 Room discrimination

3.2 Global localization

4 Architecture and training of the deep learning tools

4.1 Parameters and networks

4.2 Datasets and data augmentation

4.2.1 Training and test datasets

4.2.2 Data augmentation

4.3 Training and testing the Siamese Neural Network

4.3.1 Room Discrimination

4.3.2 Global localization

5 Experiments

5.1 Room Discrimination

5.1.1 Influence of the architecture on the feature extraction process

5.1.2 Influence of the training parameters

5.1.3 Influence of the architecture of the feature aggregation layers

5.2 Global localization

5.2.1 Influence of the training parameters

5.2.2 Influence of the architecture of the feature aggregation layers

5.2.3 General comparison with other methods

6 Conclusions

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation