1 Introduction

Evolutionary robotics (ER) is the study of using evolutionary algorithms (EAs) to create robot controllers and morphologies iteratively. Most research focuses on using ER to automatically create controllers which fulfil high-level tasks [1]. Traditional machine learning applications in robotics typically involve finding the parameters of an existing mathematical model using gradient-based optimisation techniques [2] or stochastic biologically inspired algorithms [3]. ER differs in that the entire control policy of the robot is evolved with minimal human intervention, thus yielding novel and often unexpected strategies to perform the goal task [4].

The complexity associated with visual information has limited the use of camera-equipped robots in the field of ER. The first problem is that even a low-resolution camera image will contain many pixel values. Providing these values as inputs to a controller significantly increases the complexity of the controller and the search space that the EA would have to explore to find a suitable solution. The second problem arises because most ER controllers are evolved in simulation. Building a simulated visual environment to facilitate controller evolution is a tedious and time-consuming task, requiring the manual creation of three-dimensional models, textures and lighting. A high-quality, photo-realistic environment is inevitably computationally expensive to simulate, significantly decreasing the rate at which iterations of the ER process can be performed. Furthermore, controllers will not transfer successfully to the real-world robot if the visual difference between the simulated and real-world environments is too significant. This problem is referred to as the reality gap.

This paper proposes the neuro-augmented vision for evolutionary robotics (NAVER) approach, which employs deep neural networks (NNs) to reduce and simplify the inputs provided to the controller and to make complex, high-quality visual simulators unnecessary.

The main contributions of this paper are: a theoretical description of the NAVER approach, which solves the challenges of the large number of inputs associated with image data, high computational requirements of simulated visual environments and the visual reality gap; a detailed description of how the various components of NAVER are created; and an empirical evaluation of NAVER on a difficult benchmark task.

The paper is structured as follows: Sect. 2 provides the background information needed for the related work, which is discussed in Sect. 3 and the description of the NAVER approach in Sect. 4. The experimental environment which was used to test and validate NAVER is discussed in Sect. 5. The implementation details are presented in Sects. 6 and 7, and results are discussed in Sect. 8. Conclusions are drawn in Sect. 10.

2 Background

2.1 Evolutionary algorithms

Evolutionary algorithms (EAs) are iterative optimisation algorithms that mimic biological evolution, adapting to be better and better over many generations. A population of candidates is iteratively evolved to reach a particular goal by performing crossovers and mutations on candidates in the current population to form the next-generation population. With only the ability to test the relative performance of candidate solutions, EAs can generate good solutions to accomplish a particular task [5].

2.2 Evolutionary robotics

A robot controller is the software which governs how the robot behaves. If the robot has sensors, such as light sensors, touch sensors, and cameras, the controller will receive their signals as inputs. The robot controller decides what action the robot will take. In the case of a wheeled robot, the controller will send commands to the motors attached to the wheels. On a legged robot, the controller will instruct the motors connected to the legs to move [6].

Robot controllers can be challenging to design. A hexapod robot can contain three motors for each of its six legs. Accordingly, a programmer must program commands for up to 18 motors to complete a walking motion [7]. Alternatively, consider a robot with a complex sensor, such as a camera. The program would need to identify and determine the distance of walls and other obstacles.

ER is the process of evolving robotic controllers with EAs, eliminating the need for manually programmed controllers [1]. Neural networks (NNs) are often employed in ER as robot controllers, as their behaviour can be adapted by modifying the network weights, i.e. candidate solutions in an EA population each consist of the weights of a NN controller. [6, 8,9,10,11,12,13]. The controller’s performance is measured with a fitness function. The process of training the NN to produce the desired behaviour on the robot can be viewed as an optimisation problem, where the EA must optimise the network’s weights to maximise the controller’s fitness.

ER inherits all the advantages of EAs. The programmer’s knowledge does not limit the robot’s movements and behaviours, and ER has the potential to generate robotic controllers which outperform manually programmed ones.

While using ER provides many advantages, evolving a suitable controller requires executing many candidate controllers on the robot to evaluate their fitness. Executing many controllers on a robot can take a prohibitively long time and can degrade and damage the hardware [9, 14, 15]. Consequently, controller fitness evaluation is typically performed in simulation on virtual robots.

Simulators have become a necessity in ER. Controllers can be evaluated on virtual robots in simulation, allowing the evaluation of thousands of controllers in minutes. Simulators allow larger populations to be trained for more generations, improving the performance of the evolved controllers. In addition, simulators streamline the entire evolutionary process, enabling the programmer to test many different fitness functions and different combinations of hyperparameters, such as the controller network size and evolutionary mutation rate, further improving performance [16].

2.3 Simulation in evolutionary robotics

Simulators are designed to emulate a robot’s sensors for inputs to the controller and translate the controller’s outputs to changes on the virtual robot. Consider a differentially steered robot with two wheels and eight short-range infrared sensors around its frame. A simulator would need to have a model of the physical environment in which the robot operates. In the case of a closed-loop controller, the simulator would check for anything in the range of the infrared sensors and send the appropriate signal to the controller being evaluated. If the robot has an onboard camera, then the simulator must render a visual environment for the camera’s video feed. The controller would issue motor commands to the wheels, and the simulator would move the simulated robot in the virtual environment. This cycle would repeat, and the relevant information about the simulated robot’s behaviour would be provided to the fitness function to score the controller [6].

Simulators are not a perfect model of reality. There will always be some difference between the simulated and real environments. This difference is known as the reality gap. Simulators can facilitate controller evolution provided that the reality gap remains small. If the reality gap grows too large, then as the EA optimises the controllers to operate in the simulated environment, they can hinder the controllers’ performance in the real environment [17]. The reality gap is especially challenging when an onboard camera must be simulated as the environment needs to be recreated in simulation, requiring detailed three-dimensional models, textures and lighting to match the real environment. Creating a visual simulator can be time-consuming, and creating camera-realistic scenes is processor intensive, which slows down the simulator.

Adding noise to the simulators can mitigate the reality gap effect, forcing the controllers to operate in a more variable environment [8]. For example, if the simulator tended to overreact to motor commands in the forward direction, that would mean that the virtual robot would arrive at its destination before the physical robot when following the same commands. If noise was added to the simulator, sometimes the virtual robot would arrive earlier and other times later. The controller is thus forced to be more flexible. However, too much noise can make it difficult for the EA to converge on a suitable controller.

NN-based physics simulators have been shown to provide competitive results to traditional physics simulators in ER, even producing superior results in some cases [9]. These simulators operate by training a NN to receive the commands from the controller and produce the robot’s expected movement. An advantage to NN-based simulators is their simplicity of creation. Random commands are sent to the physical robot, and its behaviour is recorded. A NN is then trained from this collected data. They require no domain-specific knowledge about physics simulation [11].

2.4 Autoencoders

An autoencoder refers to a pair of NNs, an encoder and a decoder. The encoder network E receives an input vector \(\vec {x}\) and converts it to an encoded vector \(\vec {h}\), making \(E(\vec {x})=\vec {h}\). The decoder D receives an encoded vector \(\vec {h}\) and converts it to an output vector \(\vec {y}\). During training, the encoder and decoder are combined as \(D(E(\vec {x})) = \vec {y}\) and trained as a single network. The network is optimised to make the output vector \(\vec {y}\) match the input vector \(\vec {x}\).

After training, the encoder and decoder are split, and the input vector \(\vec {x}\) can be encoded to and decoded from an encoded vector \(\vec {h}\). Since the encoding from \(\vec {x}\) to \(\vec {h}\) is determined automatically by the optimisation algorithm, autoencoders are a form of unsupervised learning. During training, the desired output vector is the same as the input vector, meaning they can be trained on unlabelled datasets.

In the case of an undercomplete autoencoder, \(\vec {h}\) will have fewer dimensions than \(\vec {x}\). This results in an encoder network that can compress information from a high-dimensional space to a low-dimensional one, and a decoder network that can restore that low-dimensional information to a high-dimensional space [18]. Many approaches to using NNs to compress images can be found in the literature [19], but autoencoders differ in that they not only compress images, but can also reconstruct images from the compressed representations.

A variational autoencoder VAE uses two deep neural networks, a convolutional neural network (CNN) and a transpose convolutional neural network (TCNN), to form an autoencoder. A CNN uses filter and convolutional layers to reduce the dimensions of inputs after each layer, while a TCNN does the opposite. CNNs are often used for image classification tasks; images are given as input, and an output layer identifies the class of each image. A TCNN receives a vector as an input and produces an image as an output using transpose convolution operations. TCNNs are sometimes referred to as deconvolutional NNs, but that name will not be used in this study to avoid confusion as a deconvolutional operation is different from a transpose convolution [18].

A CNN and TCNN can be connected to form an undercomplete autoencoder for images—a convolutional autoencoder (refer to Fig. 1). The CNN acts as the encoder, receiving images and producing an encoded vector. The TCNN is the decoder, converting the encoded vectors back into images. A convolutional autoencoder allows for high-dimensional images to be compressed into low-dimensional vectors.

Fig. 1
figure 1

A convolutional autoencoder. Most convolutional and transposed convolutional layers have a stride of 2, halving or doubling the width and height of the proceeding layer, respectively. The encoder is a traditional CNN

Convolutional autoencoders are helpful because they allow a dataset of images to be compressed into encoded vectors. They function well when the dataset contains related images, such as only satellite images or only dashcam images, as that allows them to learn and create encodings for frequently occurring features.

The optimisation algorithm is often set to reduce the difference between the input image and the reconstructed output image. In that case, the mean square error (MSE) loss measures the autoencoder’s reconstruction error [20].

A convolutional autoencoder will opt to move different images far apart from one another in the encoded space [21, 22]. While such an encoding distribution is an effective way to compress a dataset, the encoded vector’s dimensions are not very meaningful, i.e. related images may be encoded far apart in the encoded space. A variational autoencoder (VAE) [23] adds an additional loss component during training to encourage the encoded vector distribution to form a normal distribution. VAEs are an extension of regular autoencoders, and in the case of image VAEs, the additional loss component keeps images from straying too far apart in the encoded vector space [21, 22]. With VAEs, encoded vectors are often referred to as latent vectors and moving around in the latent space translates to meaningful, gradual changes in the reconstructed image.

The additional loss component in VAEs needs to measure the difference between the latent vectors’ distribution and a standard normal distribution, which is achieved using the Kullback–Leibler divergence (KLD). Traditional convolutional autoencoders only optimise the MSE loss of the reconstructed image. In contrast, VAEs optimise both the MSE reconstruction loss and the KLD loss, where the KLD loss represents the meaningfulness of the encodings [21, 22].

In a VAE, the KLD loss encourages the encoder to produce latent vectors where each vector element’s values form a standard normal distribution. However, to calculate the KLD loss, the standard deviation and mean of the latent vector distribution are needed. Instead of summing up all the values of each latent vector element over the entire training dataset, VAEs change how the encoder produces latent vectors. In a typical autoencoder, each output neuron of the encoder would produce a single latent vector element. In a VAE, however, two output neurons are required to produce a single latent vector element. Instead of producing the vector element directly, those two output neurons define a normal distribution’s standard deviation and mean. This allows two output neurons to control the shape and position of a normal distribution. The latent vector element is produced by sampling a value from that distribution, where sampling a value refers to generating a random number, and the probability distribution affects what the number is likely to be. The encoder can influence each value in the latent vector, but it cannot choose them precisely as there is a small element of randomness in each latent vector sampled. Generating the latent vectors like this makes it easy to calculate the KLD loss. The KLD loss measures the difference between distributions, and in the case of normal distributions, it can be calculated using the mean and standard deviation of each distribution.

3 Related work

The interpretation of visual information is a core characteristic for a robot to be considered intelligent. Target tracking, object recognition, and autonomous navigation are common tasks for robot vision systems. Most robot vision systems can be categorised into either one-dimensional laser range finders, which create a one-dimensional array of depth values, two-dimensional images from standard cameras, or 3D depth cameras that provide depth information for each pixel [24]. This study focused specifically on robots equipped with regular cameras. The following sections report on previous robot vision experiments in the field of evolutionary robotics as well as the various challenges in this field. Related work from reinforcement learning (RI) is presented in Sect. 3.2. A previous approach by the current authors which addresses some of the challenges of vision in ER using techniques inspired by the work in IR is briefly discussed in Sect. 3.3.

3.1 Active vision and feature selection in ER

Cameras produce high-dimensional information, meaning many individual numbers are required to represent a single video frame. In the case of a \(64 \times 48\) colour video feed containing \(64\times 48=3072\) pixels, each will require three numbers to represent their colour, and hence, each frame will require \(64\times 48\times 3=9216\) numbers. Typically, ER trains NN controllers, and attaching a high-dimensional input to a NN means that the network will have many input neurons. As the number of input neurons increases, so does the number of weights in the network. This results in an increased number of parameters for the EA to optimise, expanding the search space size and the optimisation task’s difficulty [25].

The majority of ER controllers described in the literature process image data with two common strategies—active vision [26,27,28] and feature selection [26, 27]. Both avoid using all the pixels of a high-resolution image as inputs to the NN, thereby allowing the network to process high-dimensional image data without dramatically increasing the EA’s search space.

Active vision is inspired by biological vision. Instead of processing the entire image at once, active vision sequentially focuses on different parts of a visual scene. The pixel data inside the focus region is provided to the controller, and the controller can adjust what parts of the image to focus on. Thus, the controller NN uses a small number of weights to receive the input image [26].

Feature selection processes the entire image at once. Features, such as edges, corners or shapes, are identified within the image [26]. CNNs are a feature selection approach, using filters to create feature maps from the input image. The filters are sets of shared weights, meaning a CNN has far fewer network parameters than fully connected NNs.

3.1.1 Evolution on a physical robot

Harvey et al. [29] pioneered evolving robot controllers for camera-equipped robots. Due to computational limits at the time, simulation was not feasible, so the entire evolutionary process was carried out in the real world. A special robot was developed—the Gantry Robot. It consisted of a \(64\times 64\) monochrome camera mounted on two axes. The axes allowed the camera to be positioned anywhere inside an arena. Touch sensors around the camera prevented it from making contact with the wall, and a rotating mirror attached to the camera allowed it to face in any direction.

Controllers were evolved on the Gantry robot to locate a sheet of paper mounted on one of the walls and to follow a moving white cylinder. A population of 30 controllers was trained for 15 generations over a day. It took 3 min to evaluate each controller on the robot, which adds up to 1.5 h per generation. Controllers were successfully evolved. However, despite the relatively simple task, it was found that the solutions were brittle and sensitive to the lighting in the environment.

Floreano et al. [26] used coevolution to create a combined active vision and feature selection approach to control a differentially steered robot. The task was to maximise the distance covered by the robot in an arena for 60 s without colliding with the walls. A total of 25 visual inputs were provided to the controller. These values were extracted from a \(240\times 240\) image by the evolved vision system. A population of 40 controllers were trained for 15 generations, and evolution took 1.5 h per generation. The evolved controller successfully moved the robot around the arena while avoiding the walls.

3.1.2 Evolution in simulation

Controllers for camera-equipped robots have been successfully evolved in simulation and transferred to the real world [16, 27, 30, 31]. However, they all suffer from the same drawback—creating a realistic virtual visual environment by hand is a challenging, time-consuming and often infeasible task. A visual environment requires photo-realistic three-dimensional models, textures and lighting. No simulation is perfect, and if the difference between the simulated environment and the actual environment is too significant, i.e. the reality gap is too broad, then controllers evolved in the simulated environment will not function correctly on the physical robot [1].

Floreano et al. [28] used an active vision controller for an outdoor mobile robot. The controller NN received a \(5 \times 5\) monochrome grid of pixels as an input. These pixels were a scaled-down representation of the much larger camera image. Each of the 25 pixels could either represent the average of a square of pixels in the camera image or sample a single pixel from the camera image, and the genetic algorithm could select whether pixel averaging or pixel sampling should be used. Controllers were first evolved in simulation and then transferred to the real world. The experimental results showed that controllers tended not to transfer well to the real world due to the reality gap.

Different approaches have been used to mitigate the reality gap, but each has its disadvantages. One method is to use an extremely low video resolution, such as 50\(\times \)1 monochrome [16], 5\(\times \)1 monochrome [31], and 5\(\times \)5 [27]. A lower resolution limits the controller’s perception of detail, making differences between the simulator and the real environment less apparent. However, this also severely limits the controller’s perception of the environment.

An alternative method is to use an image processing unit which receives images and provides specific information about them to the controller. This unit can take the form of an image processing algorithm, such as with Hornby et al. [30], where the images were processed and information about the objects in the image was provided to the controller, specifically their positions, angles and colours. This has the drawback that the controller is provided with less information than if it received the images themselves.

Alternatively, these image processing units could be a trained CNN. Narayan [27] evolved an active vision controller with the task of following roads and paths. The controller could vary how it perceived colours, allowing it to select a colour transformation which showed the greatest contrast between the road and the surrounding environment. A CNN was trained to identify the positions and angles of roads using a set of road datasets. The simulator only needed to provide the controller with the positions and angles of the road, not render actual road images. While this method was effective for roads, it can only be used for environments with labelled datasets.

3.2 Vision for reinforcement learning agents

The field of reinforcement learning (RL) trains software agents by rewarding desirable actions [18]. RL experiments have trained agents to complete tasks requiring visual information, such as playing a video game, and these RL agents can take the form of NNs trained with an EA [20]. The agents receive a high-resolution video input (\(64 \times 64\)) of the video game screen and send commands to play the game [20, 25, 32]. Closed-loop vision-based controllers evolved with ER share many properties with these RL agents. They both use NNs evolved by EAs and process images with NNs. Hence, there are techniques used in vision-based RL experiments which could be helpful in developing approaches for vision in ER.

Designing a RL agent to receive images, even with a CNN, can significantly increase the optimisation algorithm’s search space. For this reason, the CNN component is often trained separately before the RL process begins. In Narayan [27]’s ER experiment, a CNN was trained beforehand with a labelled dataset, and then, the EA was only tasked with evolving the controller, keeping the search space small.

Unfortunately, many RL and ER tasks exist where a labelled dataset does not exist. While labelled datasets are often unavailable, unlabelled datasets are much easier to obtain. In the case of an agent playing a video game, an unlabelled dataset would simply be a collection of images from the game’s video feed.

A CNN can be trained in the form of a frame compressor using an unlabelled dataset [20, 25, 32]. Ha and Schmidhuber [20] trained a VAE as a frame compressor. The experiment tasked an agent to play a video game. Using an autoencoder meant that it could be trained with an unlabelled dataset, which simply required collecting a few thousand images from the game.

Selecting a VAE over a traditional convolutional autoencoder meant that the latent vectors produced by the frame compressor had more meaningful dimensions. For instance, in a video game environment where the player can only move left or right, a single latent vector element represents the player’s position. When the latent vector element was low, the player was near the left wall; when it was high, the player was near the right wall. This meaningfulness in the latent vectors made it easier for the agent NN to interpret the game’s video frames.

The experiment used a recurrent neural network (RNN) to predict the next video frame as a latent vector, effectively simulating the game. This RNN provided the agent NN with an internal model of the game world. The evolved agent played the game effectively.

3.3 Neural visual simulation

The current authors previously proposed an approach, named neural visual simulation (NVS), that used a NN vision simulator in the ER process [33]. The simulator was made possible by first training a VAE on 3400 camera images gathered within the experimental environment. The encoder part of the VAE was then used as a frame compressor that converted \(48 \times 64\) colour images into a latent vector of length 8. The trained frame compressor was then used to train a NN vision simulator which took the location and orientation of the robot as input and produced the latent vector representation of what the robot would see from that position in the environment. The NN vision simulator was used with a motion simulator to evolve robot controllers to seek out target objects within an arena while avoiding the walls and other objects. During the simulation loop, the motion simulator would predict the robot’s position after each command, after which the NN vision simulator would produce a latent vector which was used as input to the robot controller. The robots learnt to interpret the latent vector inputs correctly and successfully solved the task. The evolved controllers transferred well to the real world, where the frame compressor was used to convert actual camera images into latent vectors for inputs.

NVS has several advantages over previous robot vision attempts in ER. Firstly, thanks to the frame compressor, the controllers are provided with a realistic representation of the environment (not a low-resolution partial image as in the active vision and feature selection approaches) while requiring very few inputs, thus not greatly increasing the search space. Secondly, the visual simulator could be created without the need to manually create a photo-realistic 3D environment. The simulator was trained by directly collecting images from a large number of locations in the environment. Thirdly, the visual simulator is fast, because it is a relatively small NN which simply maps location and orientation values to a latent vector of length 8, thus saving time during the evolution process where many candidate controllers must be evaluated.

The NVS approach does, however, have limitations. The first disadvantage is that NVS performs better in open environments than in closed environments, such as environments with many internal walls. The NN vision simulator learns the relationship between the position of the robot and the properties of the objects in its field of view. In a closed environment, many objects will be occluded by the walls which divide the environment. This effectively reduces the number of training images available to the NN vision simulator for each object, as images recorded from one room do not help train the NN vision simulator to render another room. Accordingly, many images are required to train the NN vision simulator for each room, making NVS infeasible for closed environments such as mazes.

The second disadvantage is that NVS can only recreate an existing physical environment. Due to the automated nature of the approach, individual objects in the environment can only be changed by creating a whole new environment. For training to succeed in the demonstration experiment, the controllers needed to perceive the objects in six different locations (so that objects could be placed at arbitrary positions when controllers were transferred to the real world). Data had to be collected from six different visual environments (in each case, objects were placed in different positions), and a NN visual simulator trained for each environment.

4 Neuro-augmented vision for evolutionary robotics

This paper proposes neuro-augmented vision for evolutionary robotics (NAVER). As with the NVS approach, NAVER uses a frame compressor to reduce the number of vision inputs provided to the controller. However, to render a virtual visual environment, NAVER uses a traditional visual simulator (discussed in Sect. 4.1) instead of a latent vector simulator. Much tedious work is typically required to create a photo-realistic visual environment in a traditional visual simulator, as three-dimensional models, textures and lighting are required. Instead, NAVER uses a second frame compressor to strip texture and lighting information from the image produced by the robot’s onboard camera (refer to Sect. 4.2). The traditional visual simulator can be simple and easy to create, requiring only three-dimensional models in which textures and lighting are ignored. This offloads the bulk of the tedious work to NNs, while providing the researcher with a high level of control over the environment. Section 4.3 describes how the first frame compressor is used in the simulation loop during the evolution process and how the second frame compressor is used when the controller is transferred to the real world.

4.1 Traditional visual simulator

With traditional visual simulators, every object inside the environment can be individually changed. No training images are required for the environment because the virtual visual environment is constructed by hand, with the modelling, textures and lighting done manually.

With NAVER, instead of manually creating a photo-realistic visual environment in the traditional simulator, it is only necessary to build the three-dimensional models. In place of realistic texturing and lighting, the models are covered in bright colours, where each different colour acts as a marker for a particular texture. This simplified environment is significantly more straightforward to create than the environment typically necessary for a traditional visual simulator while being just as easy to modify.

With NAVER, the traditional visual simulator receives a robot’s location and produces a simplified image of what the robot expects to see at that location. Instead of textures, the simulator paints each object in the scene with a solid colour—such as grey for the ground and red and white for the walls (Fig. 2). These colours allow the controller to distinguish between different objects and enable the frame compressors to link objects in the real environment to the simplified simulated environment.

Fig. 2
figure 2

Overview of a maze in a simple traditional simulator

4.2 Two frame compressors

NAVER uses two NN frame compressors. The first frame compressor receives images from the simple traditional simulator and produces latent vectors to represent them. The second frame compressor receives real-world images from the robot’s camera and produces latent vectors in the same latent space. A simulated image and a corresponding real-world image are presented to the first and second frame compressors, respectively, and both are trained to produce the same latent vector (Fig. 3).

Fig. 3
figure 3

The simple traditional simulator rendered the top image, and the bottom image was captured in the real world. Two frame compressors encode both images to the same latent vector

A robot moves to different positions and orientations in an environment and records images from an onboard camera to create a training dataset. The positional information is provided to the traditional visual simulator, which produces corresponding simplified simulated images.

Training the frame compressors requires repeating two steps. For the first step, a normal VAE is trained for a single iteration with a single minibatch of simulated images. Error is measured in the usual fashion for VAEs, as the difference between the reconstructed images and the input images, as well as the meaningfulness of the encodings. For the second step, the second encoder receives a minibatch of corresponding real images, and it is trained to match the latent vector output produced by the first encoder. These two steps are repeated until both encoders are sufficiently trained (Fig. 4).

Fig. 4
figure 4

Two steps repeat, training the two frame compressors simultaneously. Error travels backwards through the shaded arrows, only training the parameters in the shaded NNs

4.3 Simulation loop and real-world execution

For controller evolution, images from the simplified simulator are encoded by the first frame compressor and presented as latent vectors to the controller. The simulation loop is shown in Fig. 5.

Fig. 5
figure 5

Simulation loop for the NAVER approach. Other than the traditional visual simulator, each block represents a NN

During real-world evaluation, images from the onboard camera are encoded by the second frame compressor and presented as latent vectors to the controller, as illustrated in Fig. 6. Since the controller only receives latent vectors, the simple simulated and real-world environments will appear the same. This makes creating a visual simulator significantly easier and less time-consuming, as minimising the visual reality gap is offloaded onto the frame compressors.

Fig. 6
figure 6

NN configuration for running the controller on the physical robot after training is complete

4.4 Implementation steps

Figure 7 shows the steps followed to implement NAVER. First, the traditional visual simulator is created using only simplified three-dimensional models which mimic a real environment. Second, images are collected from the real environment with the physical robot. At the same time, corresponding images are collected from the simulated environment. Third, both frame compressors are trained. Fourth, a controller is evolved in a simulated environment created by the traditional visual simulator in conjunction with one of the frame compressors and a motion simulator. Last, the best controller is transferred to the physical robot for testing.

Fig. 7
figure 7

NAVER implementation steps

4.5 Discussion

The NAVER approach has several advantages. Firstly, the visual inputs are compressed to a low-dimensional vector which do not require a large number of input weights to be added to the controller. Secondly, NAVER only requires a very rudimentary traditional three-dimensional simulator, which means that it is easy to create and, perhaps more importantly, fast to execute which makes the simulation process computationally inexpensive. Thirdly, the data needed to train the frame compressors need not cover the entire environment, it merely needs to be a representative sub-area of the environment. Fourthly, the simple traditional simulator makes it easy for the experimenter to create a wide range of training scenarios to be used in the evolution process.

5 Experimental environment

The experimental work aims to demonstrate that NAVER can be used to create vision-based controllers. A controller was evolved and executed on a physical robot with an onboard camera in an environment where the robot had to complete vision-based tasks. The double T-maze, a benchmark often used in ER and other fields [34,35,36], was selected as a closed-off testing environment. The maze is constructed from long narrow corridors connected at three T-junctions (Fig. 8). Controllers are evolved to move towards a particular corner in the maze from the starting position. When travelling from the starting position, the controller will encounter two T-junctions. At each T-junction, the controller can turn left or right, and the route selected will determine which of the four corners the robot will reach. Figure 9 shows the real-world double T-maze for this experiment.

Fig. 8
figure 8

The double T-maze. The robot will begin at the start position and end at one of the four corners

Fig. 9
figure 9

Experimental environment

In a double T-maze, the robot controller is expected to be able to navigate to a corner specified by a signal. In previous research, the desired corner was specified using four input neurons, where each neuron was associated with a unique corner. Many infrared distance sensors were mounted around the robot’s chassis, providing the controller with information about its proximity to the walls [35].

For the task investigated in this study, the robot’s only sensor is the onboard camera. The controller must interpret the camera feed to turn in the correct direction at T-junctions. Instead of input neurons, visual signs signal to the controller which corner it needs to reach. A coloured piece of paper is placed on the ground at each T-junction as a sign. The sign will be one of two colours—blue to indicate that the robot should turn left and orange to turn right. The task exhibits some of the complexities of the field of self-driving cars, where the AI driver must recognise and respond to objects encountered on the road [37].

The robot is tasked with navigating to a pre-selected corner of the maze by following the directions indicated by the signs at each T-junction. The signs will alternate between executions, ensuring that a particular controller is capable of reaching any corner. The task is exceptionally difficult for a vision-based controller. The controller is not provided with any other sensor information about its position in the maze, and it has to use the camera information alone to determine its path and its proximity to the walls. The robot’s chassis is relatively large compared to the size of the passages, leaving little room for error. The camera has a narrow field of vision, meaning that the robot has to remember what it has seen before, as very little of the environment is visible in any single camera image. The signs to indicate which direction the robot must turn are visible well before the robot reaches each T-junction, and due to the camera’s limited field of view, the signs are often not visible once the robot has entered a T-junction. The controller has to be aware of its environment to successfully turn at a T-junction, as the mere presence of a sign does not mean that the robot can immediately turn. To the authors’ knowledge, this is the most ambitious vision-based task attempted on a camera-equipped robot that uses a controller evolved through ER.

6 Frame compressors

The NAVER approach uses two frame compressors to bridge the reality gap between a simple visual simulator and the real-world environment. First, a simplified traditional visual simulator was prepared, then images were recorded from the robot’s onboard camera, and the frame compressors were trained.

6.1 Traditional visual simulator

The purpose of the traditional visual simulator was to render a simplified three-dimensional model of the world given a particular robot position. Different colours on the model represented different objects in the environment. Certain colours selected for the simulator happened to be similar to the textures in the real environment, such as red and white walls. In contrast, others were different, such as the ground being grey. Whether or not the colours matched up with the real environment was inconsequential, as different frame compressors were used for the simulated and real environments, meaning none of their filters were shared. As shown in Fig. 2, green and cyan squares on the ground were used as signs to direct the robot to turn left or right. Using similar colours for the signs encouraged the frame compressor to give them similar encodings, making it easier for the controller to differentiate signs from the surrounding environment.

Four real-world images and their corresponding simulator images are displayed in Fig. 10. There was a slight difference between the angles and positions of the walls in the real-world images and simulated images. However, the difference was not significant enough to hinder the frame compressors.

Fig. 10
figure 10

Images captured in the maze with their simulated counterparts

Consider the training technique for the two frame compressors shown in Fig. 4. The first encoder dictates the meaning of each latent vector element and only receives simulated images. The second encoder tries to produce the same latent vector created by the first encoder, meaning that it will ignore any details which are not included in the simulated image. Thus, the detail in the simple visual simulator is the only information the controller will receive. Unimportant features such as imperfect jagged corners need not be modelled as the second encoder can filter them out.

Since the controller only needs to know the walls’ location, they could be represented with one colour instead of two. However, for this experiment, two colours were kept as the wall can envelop the entire camera frame in some positions, and having a second colour allows the robot to orient itself in such situations more easily.

The traditional simulator was created using a lightweight video game library for convenience. The model was created using two-dimensional planes and a collection of points that roughly matched the real-world maze proportions.

6.2 Data collection

A single passage of the maze was used for data collection. It provided examples for the walls and ground, and both coloured signs were placed at the end. Table 1 shows the mapping between the traditional simulator’s colours and their associated real-world textures.

Table 1 Mapping between colours produced by simulator and real-world textures

The robot was set to move to 460 grid points in the passage and capture 12 camera images at each point, creating just over 5500 images. The signs were then swapped, and the process was repeated, creating a dataset of just over 11,000 image pairs. An overhead webcam tracked the robot’s position, and the position information was used to create an associated image from the traditional visual simulator for each image captured in the real world. Data collection was completed in a few hours.

6.3 Frame compressor training

After the VAE was trained, the encoder network fulfilled the role of a frame compressor. For this experiment, two CNNs acted as the encoders and a TCNN served as the decoder. The encoders were trained in parallel, where the first encoder was optimised like a normal CNN VAE, and the second encoder was optimised beside it without a decoder. The gradient descent-based Adam optimisation algorithm minimised each network’s loss.

The network architectures were selected based on preliminary experiments. Hyperparameter selection has become a more relevant topic in recent years due to generative adversarial networks (GANs) being very sensitive to their hyperparameters [38]. VAEs on the other hand are much more resilient to this. Preliminary studies found that almost all reasonable choices for the architecture hyperparameters yielded a network that converge on a solution.

Both encoders had seven layers. Convolutional layers with a stride of 2 used \(5\times 5\) pixel filters, and ones with a stride of 1 used \(3\times 3\) filters. Each encoder had 16 output neurons—where two neurons formed the parameters necessary to sample a latent vector element. The encoders received \(64\times 48\) colour images normalised between \(-1\) and 1 and generated eight-element latent vectors.

The TCNN decoder also had seven layers. Figure  4 shows that each training iteration contained two steps. The first step involved optimising the first encoder and the decoder as a regular VAE. The optimisation algorithm attempted to minimise two losses at once—the MSE loss of the decoder’s reconstruction and the KLD loss of the encoder. The balance between the losses was managed as described in [33].

The second step of each iteration was training the second encoder to match the latent vector produced by the first encoder. The first encoder received simulated images, while the second received corresponding real-world images (Fig. 10). During the second step, the weights of the first encoder remained fixed and were only used to provide the second encoder with a desired output.

The VAE (the first encoder) generated a latent vector by sampling each latent vector element from normal distributions, which were shaped by the encoder network’s output neurons. This added a small aspect of randomness to the latent vectors. If the same image were provided to an encoder twice, it would produce two similar but not identical latent vectors (Fig. 11).

If the second encoder was trained to replicate the latent vectors produced by the first encoder, the optimisation algorithm would often be chasing the randomness of the sampling process. This was eliminated by training the second encoder to match the output neurons of the first encoder, minimising the difference between the standard deviations and means of the distributions used by the encoders to sample their latent vectors, not the latent vectors themselves. Thus, the second encoder was optimised by minimising the MSE between the output neurons of the two encoders.

Fig. 11
figure 11

Three latent vectors generated from the same image using a single encoder. The vectors vary slightly, but the reconstructed images are almost identical

6.4 Discussion

The frame compressors bridged the gap between the simulated visual environment and the real-world environment. Each frame compressor provided the controller with a means to interpret visual information. If the frame compressors did not produce the same latent vectors for corresponding real-world and simulated images, then controllers which performed well in simulation would perform poorly on the physical robot.

The latent vector encoding was trained unsupervised, where the NN decided what information each latent vector element would encode. This made it difficult to determine the accuracy of the frame compressors directly. Fortunately, the decoder could convert latent vectors into images, making visualisation easier. Since both frame compressors used the same latent space, the decoder could be attached to either one to assist in evaluating the accuracy of the frame compressors.

Figure 12 shows eight example patterns from the test set. With corresponding input images, both encoders generated latent vectors, which produced nearly identical reconstructed images, showing that both encoders were successfully trained.

Fig. 12
figure 12

Eight patterns from the test set, with each row being a single pattern. The simulated environment column contains images generated from the traditional visual simulator. The real environment column shows corresponding images captured with the robot’s camera. Encoder 1 receives the simulated environment images and produces latent vectors, which are then provided to the decoder, forming the first reconstruction. Encoder 2 receives the real environment images and produces latent vectors, which are then provided to the decoder, forming the second reconstruction

The second encoder and the decoder could be used together to convert the video feed from the real environment to appear like the simulated environment. Figure 13 shows several camera images converted to latent vectors by the second encoder and reconstructed by the decoder. This demonstrated what the controller would be able to perceive when running in the real environment. Smaller details such as shadows and textures were removed, but the difference between relevant objects was highlighted—such as the white wall and ground being represented as different colours.

There was a noticeable difference between the positions of the walls in the camera images and their reconstructed counterparts. This was an artefact of the error between the model’s position in the simulated environment and the real world. This difference was also visible in the training data in Fig. 10. Spending more time calibrating the traditional visual simulator to match the real world would minimise this discrepancy, but while it did degrade the controller’s perception of the environment, the difference was small enough that it could be safely ignored.

Fig. 13
figure 13

The real images captured with the camera were encoded with the second encoder into latent vectors. The decoder received the latent vectors and produced the reconstructed images

7 Controller evolution

A long short-term memory (LSTM) RNN formed the controller. It received the low-dimensional latent vectors as representations of camera images, and it sent commands to the motors. The controller’s network parameters were evolved with an EA. A simulated environment was needed to evaluate the fitness of each controller in the evolving population.

7.1 Simulated environment

NAVER only simulates the visual environment. A motion simulator is needed to calculate the robot’s position. The motion simulator was built to receive motor commands and calculate the expected changes in the robot’s position and orientation. NN-based motion simulators have been shown to provide accurate motion simulation for differentially steered robots [9], so a fully connected NN with three hidden layers of 50 neurons each was selected as a motion simulator.

The first frame compressor, visual simulator and motion simulator formed a simulated environment for controller evaluation (Fig. 5). Evaluating a controller began with placing the virtual robot at the maze’s starting position. The traditional visual simulator received the robot’s position and produced a simple untextured image. The first frame compressor encoded the simulated image into a latent vector. The controller received the latent vector and produced the motor commands. The motion simulator processed the motor commands, and the robot’s new position was again provided to the traditional visual simulator. This process was repeated for 45 timesteps, and the controller’s fitness was evaluated based on its movements.

7.2 Controller

The controller network consisted of two LSTM layers, with a fully connected layer acting as a weighted shortcut connection over each of them. Each layer contained 100 neurons. The network had 135,202 trainable parameters. The network received an eight-dimensional latent vector at each timestep and produced a speed value for each motor. The motors were run at the set speed for 500 ms before the next latent vector was provided to the network.

With NVS, latent vectors were limited to eight dimensions to keep the task of the latent vector simulator as simple as possible. Since NAVER does not use a latent vector simulator, the latent vector dimension was initially increased to 32 to improve the accuracy of the frame compressor’s image representation. However, a preliminary test revealed that 32-dimensional latent vectors were too complex for the controller NN to interpret, preventing the controllers from evolving in the simulated environment, so the latent vectors were reverted to eight dimensions again.

7.3 Evolution process

7.3.1 Population initialisation

A population of 100 candidate networks were randomly initialised, where each candidate consisted of a complete set of network parameters. Every candidate was initialised by sampling weights from a normal distribution with a mean of 0 and a standard deviation of 0.2.

7.3.2 Fitness evaluation

Each NN candidate was executed in the simulated environment four times, with one execution for each corner. The controller would be expected to direct the virtual robot forward to the first T-junction at the maze’s centre, then turn left or right based on the sign, then forward to the next T-junction, followed by another left or right turn before approaching the designated corner.

The evolved controller performed poorly when set to execute for 45 timesteps in simulation. Incremental learning remedied this problem. Each controller was only executed for 33 timesteps—enough to reach the second T-junction if it travelled near perfectly. Evolving the population with only 33 timesteps initially gave the controllers time to master navigating past the first T-junction without being concerned with the second one. After the average fitness of the whole population surpassed a threshold which indicated that most of the controllers had successfully navigated past the first T-junction, the number of timesteps was increased to 45, allowing the controllers to pass the second T-junction and reach a corner.

The fitness function incentivised a controller by rewarding it if it moved towards the target location and penalising it for moving away. The first target location would be set at the first T-junction. Once the simulated robot was within a certain distance from the target, the target was moved to the next T-junction on its path. After reaching the second T-junction, the target was moved to the designated corner. This process guided the evolving controllers through the maze.

For each controller execution, if the robot followed the signs’ directions, it would reach its target. The controllers were not explicitly programmed to interpret signs, but rather the fitness function rewarded controllers which obeyed signs correctly. Interpretation of the signs was essential as they were the only indication of the correct corner, and moving to an incorrect corner could result in a fitness score of zero. The fitness function also applied a penalty if the controller collided with the walls. The fitness function pseudo code is listed in Algorithm 1. The function OverBorder measures how much the simulated robot strayed into one of the walls (i.e. collided with the wall).

figure b

7.3.3 Selection

Tournament selection was used, dividing the population into 11 groups of around nine controllers each. The fittest controller of each group was selected for crossover, while the rest were discarded.

7.3.4 Crossover

Each controller was paired with every other controller. Each pair would serve as a set of parent controllers, from which two new child controllers were formed. Simulated binary crossover was used.

7.3.5 Mutation

After the new population of child controllers was created, they were mutated. For the mutation procedure, each controller parameter had a 20% chance of mutation, and parameters were mutated by adding a value sampled from a normal distribution with a mean of 0 and standard deviation of 0.02.

7.4 Computational performance

The VAE encoder is a NN with just over 2 million trainable parameters (which is very small for a modern deep NN), and the controller NN has just over 35,000 trainable parameters. The relatively small size of the networks means that the algorithm can easily be executed in real time on a robot equipped with a single-board processor, such as a Raspberry Pi. During training of the controller in simulation, a run which would take one minute in the real world could be completed in a fraction of a second.

8 Results

After the population was evolved, the controller with the highest fitness value was selected for testing on the physical robot. Signs were set up to direct the robot to a particular corner, and the robot completed 30 executions, where an execution would be considered successful if the robot reached the target corner without heading in the wrong direction. The signs were then changed to direct the robot to a different corner, and another 30 executions were performed. This process was repeated for each corner, totalling 120 executions.

Encouraging results were obtained. A successful controller execution was randomly selected for demonstration purposes and deeper analysis from the set of executions (Sect. 8.1). Section 8.2 follows with a summary of the overall performance over all the executions.

8.1 Single controller execution

Figure 14 shows an overhead view of a single controller execution on the physical robot. The controller was tasked with reaching the top-left corner, meaning it had to turn left at the first intersection and right at the second.

Fig. 14
figure 14

Overhead view of the robot for a real-world controller execution. Figure 15 shows the corresponding onboard video feed. The robot began at the starting position in timestep 0 and ended at the top-left corner at timestep 60. The white arrow shows the direction the robot was facing at each timestep (colour figure online)

Figure 15 shows the corresponding onboard camera feed for the execution. The real row shows the raw camera feed. These images were encoded with the second frame compressor into latent vectors. These latent vectors were then provided to the controller. The images in the reconstruction row were produced by providing those latent vectors to the decoder. The reconstruction row offers a visual depiction of what information was stored in the latent vectors and hence what visual information was available to the controller.

Fig. 15
figure 15

The onboard camera view from a real-world execution. Camera images were converted into latent vectors and provided to the controller. The reconstructed images were the decoder’s interpretation of the latent vectors, showing what the controller could see

The controller travelled forward from the starting position for the first 15 timesteps. The blue sign was clearly visible at timestep 10, indicating that the robot should turn left. It is correctly shown as dark cyan in the reconstructed image, as that was the corresponding colour used in the traditional visual simulator. This indicates that the frame compressor was encoding the camera images correctly.

When turning left at the blue sign, the robot turned slightly too far anticlockwise, meaning it was not pointed precisely straight down the passage. This is visible at timestep 20, where if the robot was to continue forward, it would collide with the passage wall. It over-corrected itself shortly afterwards, and in timestep 30, it had the opposite problem and was heading towards the other passage wall. After that, it corrected itself again, making a slight turn anticlockwise. At timestep 40, it was heading straight down the passage. This series of corrections means that the controller made ample use of the camera, as it took many steps to avoid colliding with the walls.

At timestep 40, the orange sign was clearly in view. The sign appeared bright green in the reconstructed image in Fig. 15, meaning that the frame compressor correctly identified the sign to the controller. The controller turned right and afterwards continued down the passage, ending the execution once successfully reaching the top-left corner.

8.2 All controller executions

Table 2 lists the results of the experiment. The performance of the controllers in the maze is measured over three tiers. The first tier includes all controllers which successfully travelled down the first passage and turned correctly at the first T-junction. The robot successfully completed this first section of the maze in all 120 controller executions. Sixty of these required the robot to turn right at an orange sign, and the other 60 required the robot to turn left at a blue sign.

The second tier includes all controller executions which successfully passed the first T-junction, then travelled down the second corridor and turned correctly at the second T-junction. Ninety-three of the 120 controller executions (77.5%) successfully completed this portion of the maze. It was common for the edge of the robot to nick the outer corners of the walls as it rounded the second T-junction. This may have resulted from the positional inaccuracy between the traditional simulator and the real-world environment. As visible in Fig. 10, the position of the walls in the real environment did not precisely match the walls in the simulated environment. This could have led to the controller believing that the robot could perform tighter turns than was physically possible due to the size of its chassis. It may have been possible to amend this issue by programming the fitness function to use narrower passages when deciding whether to penalise controllers for colliding with the walls. This would encourage the controllers to remain in the centre of the passages. Touching the wall at the second T-junction resulted in the robot becoming hooked on the wall, and the extra source of friction significantly changed how the robot would move when executing motor commands. Despite this unexpected event (which was impossible in simulation), the robot could use visual information to recover every time, unhooking itself from the wall and continuing down the passage.

The third tier includes all controller executions where the robot reached the correct corner. This required that the controller turned correctly at both T-junctions and travelled down the final passage until reaching the corner. 82 of the 120 controllers executions (68.3%) achieved this, successfully reaching their goal.

Table 2 Controller performance on real robot

Figures 16 and 17 show the robot paths for all 120 controller executions. Initially, the paths remained tightly together with slight variation between them. After passing the first T-junction, wobble is often visible in the paths. This occurs from controllers over-correcting themselves to avoid the walls and travel straight down the passage. This results in the paths fanning out. The large variation in the approach to the second T-junction causes a few controller executions to lead the robot in the wrong direction.

Fig. 16
figure 16

a All 30 robot paths where the target is the top-left corner. Paths are shown as blue in the beginning of the execution, green in the middle and red at the end. b All 30 robot paths where the target is the top right corner. The coloured squares are the signs. The robot is expected to turn left at a blue sign and right at an orange sign. In all executions, the robot turned in the correct direction at the first T-junction, and in a few of the executions, the robot deviated from the correct path, travelling to a bottom corner instead (colour figure online)

Fig. 17
figure 17

a All 30 robot paths where the target is the bottom right corner. Paths are shown as blue in the beginning of the execution, green in the middle and red at the end. b All 30 robot paths where the target is the bottom right corner (colour figure online)

General patterns can be observed by looking at how the paths vary between executions for the same corner, and how they very between executions for different corners. Before the first passage, all robot paths were heading in the same direction. They remain clustered together up until reaching the first T-junction. At that point, the robot paths split to either turn left or right. The controller begins the turn before crossing over into the next passage. A tighter turn around the junction leads to the robot reaching the target corner sooner. The early turning is likely a result of the emphasis that fitness function places on how long the controller takes to reach the target corner.

Overall, the controller can be described as successful. The experimental task was very challenging to complete with visual information alone. Unlike previous research, this controller had no distance sensors and had to rely on an onboard camera with a limited field of view. The turning signs were visible during most of the controller’s execution, but the controller had to wait until it was in position (when the sign was no longer visible) before performing a turn. The same controller was expected to be able to travel to any of the four corners depending on the visual information it received. The robot’s chassis was half the size of the passages, meaning the controller had little room for error. The robot navigated the maze using visual information alone, avoiding the narrow walls and following the directions of the signs.

9 Limitations of NAVER

VAEs only achieve impressive compression because they encode and decode images from a small domain—in this case, a maze with limited textures. VAEs can be trained to operate in more complex real-world environments which use many more real-world textures. However, expecting the VAE to represent those textures in a latent vector encoding may require a higher-dimensional latent vector. Preliminary experiments discovered that evolved controllers struggled to interpret higher-dimensional latent vectors. This means that while a controller evolved with NAVER could operate in environments with many real-world textures, the number of individual textures that it can differentiate from one another is limited. For example, the VAE could be trained to assign many different real-world textures to a single simulated wall colour, and the controller could successfully avoid all such walls. However, if there were many different simulated wall colours and the controller was expected to behave differently depending on which wall texture it encountered, then the number of textures it could differentiate from one another would be limited.

In summary, the primary limiting factor for the VAE is the size of the latent vector, and the size of the latent vector restricts the number of simulated colours that the controller can differentiate, not the number of real-world textures associated with those colours. The controller’s behavioural complexity will be limited, but there is no foreseeable restriction on the number of real-world textures it can interpret.

A further limitation of NAVER is that it can only render objects seen during the frame compressor training phase. The algorithm may fail in situations where it encounters unexpected objects.

10 Conclusion

NAVER was shown to provide a solution to the two main challenges associated with vision-based controllers in ER. First, the frame compressors reduce the dimensionality of the visual information provided to the controller. This allows the controller to interpret the information without significantly increasing the EA’s search space. Second, the frame compressors work together to bridge the reality gap between the traditional visual simulator and the real environment. This allows for the traditional visual simulator to be significantly simplified, removing the need for textures and lighting, making it much less time-consuming to construct.

NAVER made possible the evolution of a controller to solve a very challenging task: navigating through a double T-maze and making turns based on visual cues to reach a target corner. The controller was evolved in simulation and transferred to the physical robot, where it successfully performed the same navigation tasks.

The controller was able to navigate the maze using only visual information. The controller reacted to visual cues and changed its behaviour accordingly. NAVER was able to bridge the reality gap between the simulated environment and real environment using two frame compressors. In addition, the frame compressors enabled the controller to interpret the available visual information easily.

Future work will investigate the scalability of the NAVER approach to more complex environments and tasks. Of particular interest would be environments of such high complexity that latent vectors of higher dimensions would be required to represent an image. It remains to be determined if the ER process would be able to evolve controllers with a large number of visual inputs.

Depth cameras are able to provide information about the distances to objects in the environment and much research has recently been conducted on their use in computer vision [39]. Future work can incorporate depth camera telemetry into the NAVER approach to determine whether controller performance can be improved.