We evaluate our classifiers using two sets of test data: a set of images partially labelled in the same manner as our training data, and a smaller set that are fully labelled (i.e. every single pixel in the image is labelled). The output layer of the CNN assigns a label to every pixel, while the SVM outputs a label for each segment. Figure 5 shows some example images with their respective CNN outputs and ground truth annotations.
Due to the lack of clearly defined boundaries in some areas of off-road scenes, there exist some pixels could have more than one correct label in terms of true ground truth. This should not have much effect on the partially labelled data, as boundary regions remain largely unlabelled, however this is likely to have a negative effect on classification results when testing against fully labelled data. To limit this effect, when deciding whether a pixel is correctly labelled we search for a match within a 5 pixel radius in the ground truth image. When testing with partially labelled data, a pixel is only labelled correctly if a match is found at its exact location in the ground truth image.
When discussing the CNN, unless stated otherwise, accuracy is defined as the number of correctly labelled pixels divided by the total number of labelled pixels in the test data. When discussing the SVM, accuracy is defined as the number of correctly labelled segments divided by the total number of labelled segments in the test data.
3.1 CNN with Partially Labelled Test Data
First we compare classification accuracy from training the network on our full off-road data set as well as smaller subsets thereof after different amounts of pre-training, and testing on our partially labelled test data set.
Pre-training Iterations.
Table 1 shows the performance of the network on Camvid test data before any training with off-road data, with accuracy recorded at the six points from which transfer learning was to be performed. As the Camvid data is mostly fully labelled, we use the same measure of accuracy as we use with our fully labelled off-road test data set, wherein a label is deemed to be correct if it is within a 5 pixel radius of a similarly labelled pixel in the ground truth image.
Table 1. Accuracy of the CNN on the Camvid test data at the points when snapshots are taken to perform transfer learning
These results demonstrate the network has rapid performance improvement over its first 10,000 training iterations, followed by a slower but consistent improvement in performance during later training iterations.
Figure 6 shows the results achieved by each pre-trained version of the network on our full data set. Each version of the network was trained for 10,000 iterations, with a snapshot taken and accuracy recorded first at every 100 iterations, then at every 1000 iterations.
The results show that the first few thousand iterations clearly benefit from transfer learning, with the networks that have performed a greater amount of pre-training generally performing better. However, by 5000 iterations of training, even the network initialised with random weights has achieved an accuracy of close to 0.9, beyond which there is very little improvement from any of the networks.
As the training continues, the networks pre-trained for longer give marginally better results. The highest accuracy achieved is 0.917, which comes after 8000 iterations of the network that was pre-trained for 30,000 iterations. The networks pre-trained for 20,000 and 30,000 iterations show very similar results throughout the training, suggesting a limit to the performance gains that can be achieved by pre-training.
Training was continued up to 20,000 iterations with each network, however this gave no further increase in accuracy and so only the first 10,000 iterations are shown.
It is interesting to note that our results surpass those achieved by their respective networks on the Camvid test data within a few hundred iterations, and then go on to perform significantly better. This could partly result from our data-set containing fewer classes (8 vs 11). Another factor could be our partially labelled test data, which features very few class boundary regions, however further testing with fully labelled data shows similar performance. It is possible that partially labelled training data could lead to a better performing classifier due to the lack of potentially confusing boundary pixels, although to fully test this we would need to compare these results to those obtained by training an identical network with a fully labelled version of the same data set, which is beyond the scope of this paper.
Data Set Size.
To consider the effect the amount of training data used has on classification, we train networks using five different sized subsets of our training data, containing 140, 70, 35, 17 and 8 images, both with and without pre-training. Figure 7 compares results for three of these subsets, each trained for 10,000 iterations.
The effects of transfer learning are similar: for the first 1000 iterations, the benefits of pre-training are clear, however after just a few thousand more, both pre-trained and un-pre-trained networks have achieved close to their optimum performance. As training progresses, the pre-trained network consistently outperforms the non-pre-trained network by a small margin, which generally increases as the dataset size decreases: After 10,000 iterations with a dataset of 140 images, the accuracy of the pre-trained network is just 0.01 better than the un-pre-trained network, while with the dataset of 8 images, this margin increases to 0.09.
Per Class Results.
We now discuss in more detail the results from the CNN trained for 10,000 iterations on the full data set after 30,000 iterations of pre-training. This is the network configuration that we would expect to typically perform best, with the highest amount of pre-training and largest data set, and it consistently achieves an accuracy of 0.91 against our partially labelled test data once it has passed 5000 iterations. Figure 8 shows the proportion of pixels belonging to each class that were given each possible incorrect label.
The most common misclassifications are between grass, foliage and trees, which is understandable given their visual similarities. Proportionally to class size, the largest is the 20.5 % of pixels containing Man Made Obstacles that are misclassified as Tree. This is likely because many of the man-made obstacles in the off-road environment, such as fences, posts and gates, are made of wood and so have a similar appearance to trees.
Figure 9 plots the precision and recall of each class along with the proportion of the training data set that each class makes up. The foliage class performed worst, likely due to its visual similarity to both grass and trees, while sky gave the best results. Camera exposure was set to capture maximum detail at ground level, so in most instances the sky is much brighter than the rest of the scene, which combined with its lack of high frequency detail and consistent placement at the top of an image makes it easily distinguishable from other classes.
For the most part, classes that achieve high precision also achieve high recall, however man-made obstacle is an exception, with a very high precision (0.92) but lowest overall recall (0.613), meaning very few pixels are misclassified as man-made obstacle, while many pixels which should be labelled man-made obstacle are not. The fact that it is the class with fewest training samples (594,125 pixels) is likely to have played a part in this, as well as its visual similarity to trees, as discussed above.
There would appear to be some correlation between the frequency of a class within the data set and its recall, possibly because of the way the output is weighted towards classes that appear more often.
3.2 Fully Labelled Test Images
Currently we have only discussed the results obtained through testing the CNN classifier against partially labelled data, thus we also test it against a set of fully annotated images to demonstrate that it can achieve similar results.
Figure 10 show the results obtained, and demonstrates that testing with fully labelled images yields results very similar to those of the partially labelled set. The highest accuracy seen was with the network pre-trained for 5,000 iterations, with an accuracy of 0.924 after 8000 iterations of training with the full off-road data set.
Interestingly, the network snapshots that perform poorly on the partially labelled set (i.e. those that have not yet been through enough training iterations or have only been trained on a small data set) tend to perform worse on the fully labelled images. By contrast, those that perform well on the partially labelled data exhibit less deterioration, and in some cases even demonstrate an improvement in accuracy, when the fully labelled set is used. This would appear to suggest that a more comprehensively trained network performs much better in class boundary regions.
Another point of note is that with the partially labelled data set, a network that had undergone greater pre-training would almost always perform better, however, when testing with the fully labelled data set, the networks that have undergone 5000 and 10,000 pre-training iterations consistently outperform those with 20,000 and 30,000 iterations, although only by a very small margin, at the later stages of training. This could be because the networks that have undergone more pre-training begin to overfit to the data they were originally trained on. The fact that this only occurs when the fully labelled data is used might suggest that this overfitting only has a noticeable effect when classifying class boundary regions, which are not present in the partially labelled data.
3.3 SVM
For Comparison, we test the SVM approach on its ability to classify segments from our off-road data set. The SVM parameters are automatically optimised through cross-validation, however we test several different configurations for the features that we pass into the classifier. The parameters that we alter are g, the number of pixels between feature points in our grid, r, the radius in pixels around each feature point that our descriptors take account of, and K, the number of clusters used to build our bag-of-words. Figure 11 shows several comparisons to demonstrate how performance is affected.
We would expect a decrease in g to improve results, as a greater amount of detail is being considered. This partly holds true in our results, although not consistently so. As r changes, we initially see a consistent improvement in results, which begins to tail off after a while. This is likely because with r set too restrictively, each feature point only has access to a limited region of local gradient information. By contrast, with r set too large, high frequency detail is lost as the descriptor is built from a greater number of pixels. The optimum value appears to be around r = 10. The general trend for K is that larger is better, but memory and time constraints make too large a value impractical.
The best result attained by the SVM was an accuracy of 0.813, using the parameters g = 6, r = 12, K = 1400.
To properly compare SVM and CNN performance, we adapted our CNN classifier to label whole segments. This was done by winner-takes-all vote of pixel labels within the segment. Figure 12 shows the segment classification results as the CNN is trained after different amounts of pre-training. The CNNs pre-trained for 10,000 or more iterations all achieve results better than those of the SVM before 1000 iterations, and by 2000 iterations all, including the CNN that has undergone no pre-training, have surpassed the SVM. After further training, segment classification results are very similar to those for pixel classification, peaking at around 0.91, confirming that the CNN is significantly more effective at this classification task than the SVM.