Traffic signal detection and classification in street views using an attention model

Detecting small objects is a challenging task. We focus on a special case: the detection and classification of traffic signals in street views. We present a novel framework that utilizes a visual attention model to make detection more efficient, without loss of accuracy, and which generalizes. The attention model is designed to generate a small set of candidate regions at a suitable scale so that small targets can be better located and classified. In order to evaluate our method in the context of traffic signal detection, we have built a traffic light benchmark with over 15,000 traffic light instances, based on Tencent street view panoramas. We have tested our method both on the dataset we have built and the Tsinghua–Tencent 100K (TT100K) traffic sign benchmark. Experiments show that our method has superior detection performance and is quicker than the general faster RCNN object detection framework on both datasets. It is competitive with state-of-the-art specialist traffic sign detectors on TT100K, but is an order of magnitude faster. To show generality, we tested it on the LISA dataset without tuning, and obtained an average precision in excess of 90%.


Introduction
Object detection and classification are important tasks in computer vision.The task is especially challenging when target objects are relatively small and are surrounded by a high degree of background clutter, as is the case for traffic signals in street views.For example, images of a street can be captured at high resolution (e.g., 2048 × 2048 pixels), but vital information, such as traffic signs and traffic lights, are often contained in very small regions (e.g., 50 × 50).With recent developments in autonomous driving, advanced driver assistance systems, and intelligent vehicles, visual content captured by vehicle mounted equipment plays an increasingly important role in perception of the environment and navigation guidance, as you can see in Fig. 1.
Traffic sign and traffic light detection and classification have attracted much study.Any algorithm for road sign detection must be reliable, fast, and general, and should produce results early.It must be reliable so that signs are robustly detected, fast so that other decisions such as sign recognition can take place, general to account for differences in signs between countries, and early so that signs are determined sufficiently distant from the vehicle to allow a safe response time.
Most previous works utilize color, texture, and geometric features as input to machine learning methods such as SVM and tree classifiers to distinguish either targets from background, or different classes of targets.As convolutional neural networks (CNNs) have been found to have superior performance for object classification and detection, they are extensively used in this area.CNN-based methods (e.g., faster RCNN [1] and SSD [2]) with state-of-the-art performance on PASCAL VOC and ImageNet ILSVRC datasets focus on large scale objects in images, whereas traffic signs should be detected early, i.e., at small scale.
CNNs have been built specifically for traffic sign recognition and detection [3].Jin et al. [4] achieved a recognition rate of 99.65% on the German Traffic Sign Recognition Benchmark with CNNs.Zhu et al. [5] published a more practical traffic sign dataset, the Tsinghua-Tencent 100K (TT100K) dataset, and proposed a CNN-based method that performs better than fast RCNN [6].Although the method proposed by Zhu et al. achieves high recall and accuracy, it uses a CNN to scan the whole high resolution image at different scales, which is time-consuming.We propose a more efficient system that utilizes a visual attention model to reduce computation in background areas.We show results of testing it on traffic sign and traffic light detection and classification tasks.
Our CNN approach exhibits all four of the properties needed for traffic signal detection: it is reliable, fast, able to detect signals at a wide variety of scales, and generalises to new datasets obtained in a country different to the training set.Its key technical contribution is use of a visual attention model.Studies in neuroscience [7] suggest that instead of forming a coherent, richly detailed representation of all the objects in the scene, the human visual system tends to focus attention on one object at a time.The perception that all objects are represented in detail simultaneously is a subjective construction enabled by coordinating attention in a few areas deemed to be salient.In this way, perception requires far less processing and memory resources.Similarly, when detecting objects in images, we can design algorithms that focus only on certain regions instead of processing the whole image at high resolution.We use a two-stage framework to accomplish the process.The first stage, based on an attention proposal model, is trained on a low resolution version of the scene to propose attention regions: those regions which should be examined in detail.The second stage, the accurate localization and recognition network, detects and classifies targets within the attention regions, at full resolution.Our framework greatly reduces computational and memory resources since it avoids computing detailed representations of most of the background.Moreover, it avoids processing at multiple scales since the regions proposed by the first stage implicitly contain scale information.
We have evaluated our system for traffic sign and traffic light detection and classification tasks using three datasets: the Tsinghua-Tencent 100K (TT100K) dataset, our own purpose-built dataset, the Tsinghua-Tencent traffic light (TTTL) dataset, which is based on Tencent street views, and the LISA dataset [8] to test generalization.The TTTL dataset contains over 16,000 high resolution images covering various driving scenes.Our system achieved 86.6% mAP (mean average precision) on the dataset, performing better and more reliably than faster RCNN.For the TT100K dataset, an mAP of 87.0% is achieved, which is close to the state-of-the-art method proposed by Zhu et al. [5], but our algorithm is an order of magnitude faster than theirs.It generalizes well, without tuning, to the LISA dataset that was constructed in USA.
Our main contributions are as follows: • A novel attention-model based two-stage detection framework.The framework is designed to detect small targets in large high resolution images.The attention model makes our method efficient as it processes images at low resolution and generates a small set of candidate regions.It also makes detection more accurate since the small targets occupy a large portion of the area of attention regions.• A new street view traffic light benchmark TTTL.
To the best of our knowledge, there are no readily available benchmarks for traffic light detection other than the LISA traffic light dataset [8].
We annotated around 10K Tencent street view panorama images and built a dataset with more than 15K traffic light instances.They cover a wide range of street scenes and lighting conditions, and exhibit traffic lights in forms not present in LISA.
The remainder of the paper explains our contribution in greater detail.

Traffic light and traffic sign detection and classification
Traffic signals such as traffic lights and traffic signs provide important information for driving, and many algorithms have been developed to detect and recognize them.Diaz et al. [9] presented an exhaustive survey of current techniques for such purposes.
Traffic signal detection and recognition methods since about 2007 have been based on color segmentation, shape, and texture features in conjunction with SVM classifiers [10].Illumination conditions have been studied by Jang et al. [11] and De Charette and Nashashibi [12], whose detector was shape based.Slightly later, shape was also used by Cai et al. [13] to recognize arrow traffic lights.More recently color segmentation has been used, e.g., by Sooksatra and Kondo [14].Ji et al. [15] proposed a visual selective attention model based on color to construct salience maps; traffic lights are then detected using an SVM classier with HOG features.This is similar in spirit but different in practice to our work.Some prior art utilizes digital maps and GPS information to improve the efficiency and accuracy of detection [16,17].However, prior information is not always accessible and is not in principle necessary (humans make no use of such data).
The widespread adoption of convolutional neural networks (CNNs) has seen their application to traffic signal detection.John et al. [17,18] showed that CNNs are effective as classifiers of traffic lights, but they used traditional methods for saliency map generation.In 2016, Zhu et al. [5] published the Tsinghua-Tencent 100K dataset for traffic sign benchmark, and developed an end-to-end CNN for both detection and classification.Their approach provides excellent reliability and operates at a range of scales, but has to scan high resolution images at different scales, which impedes its efficiency.

CNN-based detection
In the past few years, a wide variety of CNN-based approaches have been developed for object detection.Sermanet et al. [19] proposed the Overfeat framework by sliding a fully convolutional network over an input image to produce classification and bounding box regression results.Zhu et al. [5] adopted this framework for traffic sign detection and classification.Girshick et al. [20] combined region proposal algorithms and CNN as a feature extractor to perform the detection task, in an approach known as RCNN.To avoid redundant computations on overlapping region proposals in RCNN, spatial pyramid pooling (SPPNet) [21] and ROI pooling (fast RCNN) [6] have since been developed.The input image is fed forward into convolutional layers only once and ROI pooling extracts fixed length feature vectors from the feature maps.This greatly accelerates training and testing.Faster RCNN, proposed by Ren et al. [1], further improves the framework by replacing the selective search [22] by a region proposal network (RPN), which shares its convolutional layers with the classification and regression network.The RPN generates fewer proposals yet achieves state-of-theart results on PASCAL VOC 2007, 2012, and MS COCO datasets.Single-shot approaches have also been proposed, such as YOLO [23] and SSD [2].They leave out region proposal production and ROI pooling, and directly conduct box regression on feature maps.While this makes them even faster, YOLO's accuracy falls below that of fast RCNN and faster RCNN.
All of these CNN-based detection frameworks were designed for large scale objects.Directly applying them to detect extremely small objects in high resolution images gives results that are inefficient, inaccurate, or both.

Visual attention model
Visual attention models are inspired by studies of attention mechanisms of human visual system [7], and have found their way into CNNs.
Mnih et al. [24] developed a recurrent neural network (RNN) that selectively processes a sequence of regions of an input image at high resolution.The network takes a region of the image (called a glimpse) and its location as input and determines the location of the next region to be processed.The computational expense is independent of image size, since only a few glimpses are taken and they contain many fewer pixels than the original image.The system outperforms CNN on image classification on the MNIST dataset.
Working to recognise house numbers in the SVHN dataset, Ba et al. [25] extended the idea to multiple object recognition by fixing the number of glimpses for each target in the object label sequence, and adding an end-of-sequence label.They added a contextual network that takes a low resolution version of the original image as input and provides initial state for the RNN.Their problem is formulated as classification rather than detection, but nonetheless their model is close in design to our attention proposal model for target detection.Their deep recurrent attention model (DRAM) is trained using reinforcement learning to find attention regions implicitly without the localization loss that we use (see Section 3.2).
Huang et al. [26] utilized the recurrent attention model to detect arbitrary oriented text in the wild and achieved state-of-the-art accuracy on ICDAR 2013 and MSRA-TD500.However, their detection pipeline depends on extremal regions to generate initial attention proposals and on CNN classifiers to filter non-text proposals.In contrast, we directly utilize an attention model to propose target regions.
Gidaris et al. [27] proposed the AttractioNet approach, which generates box proposals by iterative attention and refinement processes.Their approach surpasses all other box proposal methods.It differs from our method in that our attention model proposes (wide) context regions rather than (tight) object bounding boxes.In the case of multi-region detection, Gidaris et al. [28] improved results by making use of context information integrated from the immediate area surrounding each box; this points to the importance of context, a lesson echoed in our results.

Concepts
In this section, the framework of our system for small target detection and classification is presented.The system is composed of two parts: the attention proposal modeller (APM) and the accurate locator and recognizer (ALR).The two parts are designed to accomplish different tasks: the APM proposes attention regions that are likely to contain targets, telling ALR where to look; the ALR then localizes and classifies targets in these attention regions.Both tasks can be formulated as taking raw image pixels as input and performing regression on the coordinates of certain boxes.Since faster RCNN performs impressively at such a task, we adopt its structure as the basis for both parts.The difference is that the APM performs regression on the bounding box of an attention region while the ALR performs regression on the bounding box of a real object.Figure 2 provides an overview of the framework.We next discuss the design of the two parts.

Attention proposal modeller
The aim of the APM is not to precisely locate the targets, but to provide candidate regions with high confidence at low computational cost.This task depends more on global information about the whole image and less on details of the targets.Thus, the original high resolution image may be down-sampled to lower resolution for this purpose.We formulate the task as follows: given a high resolution image I H , the APM takes the corresponding down-sampled image I L as input, and outputs a set of at most K attention regions These attention regions are cropped from I H for use as input to the ALR which accurately locates and classifies targets within them.
Our approach to producing attention regions is based on faster RCNN, which solves the following problem: given an input image, output a set of region proposals with their locations (x, y) and size (w, h).Each region proposed has an associated value that measures the confidence that the region contains an object of interest.Following faster RCNN, the APM comprises a region proposal network (RPN) and a fast RCNN.They share a convolutional sub-network that outputs a feature map of spatial size W × H.The RPN generates region proposals based on anchor boxes at each position of the feature map.High confidence proposals are then processed by fast RCNN through ROI pooling and fully connected layers.Both RPN and fast RCNN output box regression results and confidence scores; they are trained with ground truth boxes.In our case, we only have ground truth bounding boxes for traffic signals, but we can define the bounding boxes for attention regions.The attention region should enclose the traffic signal and its size should be proportional to the object size (for the reason stated below): see Fig. 3. Thus, the attention box (x * , y * , w * , h * ) is defined as follows: where (x 0 , y 0 , w 0 , h 0 ) is the bounding box of a traffic light or sign.The scale α is set so that traffic lights or signs are contained within the proposal, but not so large as to slow a detailed search for the object at the next stage; we find α = 5 to be a suitable choice.Now that we have the ground truth boxes of attention regions, we parametrize the coordinates of boxes as in Ref. [1].For the RPN, the parametrized coordinates are calculated using: where (t * x , t * y , t * w , t * h ) are coordinates of the ground truth and (t x , t y , t w , t h ) are the RPN prediction, (x a , y a , w a , h a ) is the anchor box and (x, y, w, h) is the predicted box.Smooth L 1 loss [6] is used for Fig. 3 The attention box is a square region enclosing the target bounding box, with side length α times that of the longer side of the bounding box.regression in the RPN, which seeks to minimise the function: Here, L RPN loc is the regression loss of RPN.The classification loss is a cross-entropy loss of softmax output: . N is the number of samples and p n,l n is the predicted softmax probability of the nth sample belonging to the ground truth class l n .There are two classes in the APM, one for the attention region and one for the background.The ground truth is based on the anchor box matching strategy proposed in Ref. [1].For the fast RCNN subnetwork, the ground truth is also parametrized, but with respect to proposals from the RPN rather than anchor boxes: where u * i and u i (i ∈ x, y, w, h) are the parametrized coordinates of the ground truth and the fast RCNN prediction respectively.(x p , y p , w p , h p ) is the box proposed by RPN and (x f , y f , w f , h f ) is the predicted box from fast RCNN.Smooth L 1 loss is also used for regression in fast RCNN: The classification loss L cls is also a 2-class softmax loss, and is defined as To sum up, the overall loss function of the APM is We set λ 1 , λ 2 , λ 3 to 1, as Ren et al. [1] found training is insensitive to their values over a wide range.
The number of proposals generated by the APM determines the computational cost of the second stage, so this number should not be too large while guaranteeing high recall.We apply non-maximum suppression (NMS) and filter the proposals with a confidence threshold T to reduce proposal number.
The maximum number of proposals is set to K.
If more than K proposals are generated, only the K highest scored proposals are considered by the second stage.We empirically choose K = 8.The architecture of the network is shown in Fig. 2 and Table 1.The shareable convolutional layers are similar to the Zeiler and Fergus model [29] and there are 3 fully connected layers after ROI pooling.
The RPN parameters such as anchor numbers, NMS threshold, and proposal numbers are set to the same values as in Ref. [1].It is worth noting that other detection algorithms could also be used as the attention proposal modeller, as long as they can generate attention regions of a similar kind and can produce a reasonably small set of results.With the definition of the attention box, we are able to train a faster RCNN to propose a small set of regions for further examination.This reduces the computational cost in two ways: the APM only needs a low resolution image input for attention proposals, and only a few regions need to be processed at high resolution in the second detection stage.

Accurate localization and recognition
The APM output is a set of regions, and only those regions need be examined for targets during the second stage.This brings two advantages.Firstly, it saves considerable computation, since only a small part of the original high resolution image, rather than the whole, is taken as input.Secondly, the proportion of object area to the attention region area is much larger than to the original image area, making the localization and recognition task easier.Detection algorithms such as fast RCNN and faster RCNN are usually poor at detecting small objects, but if attention regions are resized to reasonable scale, such algorithms would be suitable for detecting the originally small objects.The ALR takes attention regions proposed by the APM as input and scales them all to the same size, chosen based on performance on the validation set.Since the APM is supposed to predict regions whose sizes are α times as large as the target size, the target sizes in the scaled inputs lie within a small range, further simplifying the task.As we will see in the next section, the target sizes in the rescaled attention regions are concentrated in a narrow range, which helps achieve better performance for originally small targets.Many detection algorithms can be used as the second stage localizer and recognizer (ALR).We use the faster RCNN framework as it provides stateof-the-art results for many detection tasks.The architecture is the same as for the APM shown in Table 1 except that the number of class score outputs is adjusted to match the number of label classes.For the traffic sign dataset, the model is trained to recognize 45 classes of signs and to predict their bounding boxes, while for the traffic light dataset, lights need to be classified into 6 categories, and the light housing bounding boxes need to be regressed.For all other settings of the framework, we just follow faster RCNN [1].
At testing time, all proposed regions are fed forward in a batch and the output bounding boxes are transformed to their original position in I H . Then NMS is applied to yield the final localization results.

Experiment
We performed experiments on detection and classification of two kinds of small targets in street views: traffic signs (see Section 4.1) and traffic lights (see Section 4.2).The experiments on traffic signs used the Tsinghua-Tencent 100K dataset [5], and we compare our method to the method in that paper.The traffic light detection and classification experiments used our TTTL dataset, as well as the LISA traffic light dataset to test generality.

Traffic sign detection and classification
To make a fair comparison with the method in Ref. [5], we used their training and testing data.There are 45 classes of traffic signs; each class has more than 100 instances.We did not follow their data augmentation protocol, in which they blend traffic sign templates with background street views to generate more data.To diminish the imbalance in number of samples between classes, we oversampled classes with fewer than 1000 instances to ensure that each class had over 1000 samples in each epoch.No other data augmentation was conducted.For attention model training, we set the enlargment ratio α of target bounding boxes to 5. The attention region boxes were not class specific, i.e., there were only two classes, attention region and background, in the attention model.

Training
When training the APM, we resized the original 2048 × 2048 high resolution images to 480 × 480 lower resolution images, and trained the network with a single image per batch for 100,000 iterations, with approximately 15 epochs over the training data.
For the ALR, we trained the network on the attention regions generated by the attention model.There were about 47,000 images per epoch and the network was trained for 500,000 iterations with batch size 1.The input images were resized to 360 × 360.
For both APM and ALR, we used SGD with initial learning rate 10 −3 and momentum 0.9.The learning rate was set to 5 × 10 −4 after 300,000 iterations for ALR.We set the dropout ratio to 0.5 for the fc6 and fc7 layers.Both networks were trained from scratch, after initialization using the method of He et al. [30].When testing the system, the input size of both networks was the same as for training, and the maximum number of attention proposals K was set to 8.

Evaluation
We evaluated our method on the Tsinghua-Tencent 100K test dataset.It achieved 87.0% mAP at a Jaccard similarity coefficient of 0.5 and the average recall and precision at highest F1 score were 83.4% and 91.7% respectively.The performance is close to the state-of-the-art method due to Zhu et al. [5], which has an mAP of 87.5%, an average recall of 86.0%, and an accuracy of 88.3%.However, our method is an order faster than their Overfeat-based method, as we avoid scanning the whole high resolution image and detecting at multiple scales; they process input images at scales 0.5, 1, 2, and 4. The original images are of size 2048 × 2048 so that Zhu et al.'s largest input image has size 8192 × 8192, which incurs a tremendous computational cost.Our approach takes only 0.3 s to process the same image.
We also evaluated faster RCNN [1] on the dataset as a baseline method.We used ALR alone to detect and classify targets on the original high resolution image.Both its performance and efficiency are lower than those of our method.Table 2 gives a detailed comparison of the three methods.All methods were benchmarked on an nVidia GTX980 GPU.
To demonstrate that the attention model can propose regions at a suitable scale, we examined the statistics of the target sizes in attention regions resized to 360 × 360, and in the original images.All targets in the TT100K testing set were considered.We used the square root of the area of the target bounding boxes as a measure of target sizes.As shown in Fig. 4(a), in attention regions, more than 80% of the targets were in the size range [32, 96] pixels, while the original size of over 40% of the targets was smaller than 32 pixels.In other words, the targets originally had widely differeing sizes, but in attention regions they were concentrated at medium sizes.This was to be expected since the APM was trained to propose attention region boxes that are α times as large as the target bounding box.
Our APM inherently solves the problem of adjusting scale, making it easier for the ALR to accurately locate and classify targets.Therefore, our method performs just as well on small (area < 322), medium (322 < area < 962), and large (area > 962) targets, as shown in Figs.4(b)-4(d).In contrast with faster RCNN, which has poor performance on small targets, our method and the Overfeat method proposed by Ref. [5] both have high recall and precision on small targets.Our method can furthermore detect small targets that Overfeat fails to detect, as shown in Fig. 6.
Our method may miss targets in unusual contexts since the APM is unlikely to propose such regions for further detection.Such failures are shown in Fig. 6.

Traffic light detection and classification
We also tested our method on traffic light detection and classification.Although many methods have been proposed for these tasks, there is no readily available specific dataset with high resolution street view images.We thus built a dataset specifically for traffic light detection and classification, in order to evaluate our method and to provide a benchmark for other studies.We also used the LISA traffic light database [8] to test the generalization capability of our model: the training set, TTTL, and the LISA test set originate in different countries and therefore exhibit differences in traffic lights.
We chose to build this dataset using Tencent street view data, as street views are closer to driving scenarios than photos taken by pedestrians with cameras or cell phones.Furthermore, there is sufficient Tencent street view data to cover diverse scenes and lighting conditions, providing a good test of the method robustness.While the LISA dataset contains continuous frames from video sequences, we picked street views from different places to ensure diversity.The dataset consists of more than 16,000 images; about 8300 of them contain traffic lights.We call it the Tsinghua-Tencent traffic light (TTTL) dataset.We trained our networks on a training set of over 6700 images and evaluated them on the testing set of about 1600 images.We also tested the trained model on 6 clips from the LISA dataset.All experiments used an nVidia GTX 980 GPU.

Tsinghua-Tencent traffic light dataset
The Tencent street view images were captured by vehicle or shoulder mounted cameras, and postprocessed to form 8192 × 4096 pixel high resolution panoramas.Since the upper and bottom parts of the panoramas are mainly sky and ground, the images are cropped to between 25% and 62.5% of their height, and then split into 4 pieces horizontally.This yielded 16,313 images of size 2048×1536 pixels.Those images were annotated with bounding boxes of the traffic light surrounds, bounding boxes of the lit bulbs, and the kinds of lights.We have 15 classes of lights, including an other class and an unrecognisable class.Some classes have very few instances, so we just considered 6 major classes: green, red, red left turn, green forward, red pedestrian, and other.Ignoring those images without traffic lights, we randomly split the dataset into training and testing set in the ratio of 4:1, yielding 6709 training images and 1656 testing images.Table 3 gives the number of instances and example images for each of the 6 classes.

Training
As for the traffic sign task, we resized the original 2048 × 1536 images to 480 × 360 lower resolution images as inputs to the APM.The network was trained for 75,000 iterations, and about 11 epochs, with batch size 1.The trained APM was used to generate about 13,000 attention region images over the training data.The ALR was then trained with those images for 500,000 iterations with a single image per batch.The attention region images were also resized to 360 × 360 as input.As in the traffic sign task, SGD was used and the learning rate scheduler was the same; the dropout ratio of fc6 and fc7 was 0.5.Both networks were initialized with the method proposed by He et al. [30].When testing the system, the input size of both networks was the same as for training, and the maximum number of attention proposals K was set to 8.

Evaluation
We tested our method on the TTTL testing set, achieving an mAP of 86.2%, an average recall of 83.6%, and an average precision of 84.7%, without considering the other class.As shown in Table 4, our method performs better and runs faster than the baseline faster RCNN.The performance on targets of different sizes are shown in Fig. 5. Similarly to the results found for traffic signs, although over 33% of targets original sizes are smaller than 32 pixels, in the resized attention regions proposed by APM, nearly 90% of them are concentrated in the size range [32,96].The recall and precision for small, medium, and large targets are close to each other.In comparison to the baseline faster RCNN, the performance for medium and large targets is similar, but our method has much higher recall and precision for small targets, due to the relatively larger proportion of those small targets to the attention regions.Figure 7 shows some examples of our results in various challenging cases.Our method is robust to variations in lighting conditions such as overexposure and underexposure, and different environmental contexts such as underneath bridges and the entrances of tunnels.

Generalization
To evaluate the generalization capability of our method trained on the TTTL dataset, we tested show the recall-precision curves of our method and faster RCNN for different target sizes; our method performs much better than faster RCNN on small targets.In the first two cases, our method detected small targets that Overfeat missed.In the last two cases, our method missed some targets, as the APM failed to propose the those regions.5. Our model has high overall recall and precision on the go and stop classes in the LISA dataset, even though it was not trained on any data in LISA.It demonstrates that our model has good generalization capability.While the data from TTTL are all street views in China, the videos in LISA are all captured from US roads and have different lighting and weather conditions.These results show the robustness of our model with respect to varying scenes and natural conditions.We note that for DayClip5, the precision of stop is low.This may be because our model classifies stop left lights as stop lights.In video sequences, there are many very similar frames, so any mistakes made by the method are repeated many times in DayClip5.Similarly, in DayClip8 the mistake that the model confuses go with go left is repeated.Also, there are very small traffic lights that are not annotated, but are detected by our model, explaining the relatively low precision.Hopefully, the classification performance could be improved by fine tuning our model on the LISA dataset.

Conclusions
In this paper, we have presented an attention model based detection framework to tackle the problem of detecting small objects in large high resolution images.We applied our method to traffic signal detection in street view images.As a complement to the TT100K benchmark, we have built the Tsinghua-Tencent traffic light dataset for training and testing.
Our framework outperforms the baseline faster RCNN on both datasets, especially when detecting small targets with area less than 322 pixels.Furthermore, our system runs an order of magnitude faster than the state-of-the-art on TT100K, while having similar recall and precision.Experiments show that the attention proposal model can generate a small set of candidate regions whose area as a proportion of target size lies in a narrow range, making the second stage localization and classification more accurate.Our model trained on the TTTL dataset also shows good generalization capability, achieving high recall and precision on the LISA dataset without any training on it.
In future, we hope to improve the recall of our framework by exploring better attention proposal methods.As our framework is intended for detection in still images, we would like to develop a method for video sequences that utilizes previous detection results to further reduce computational cost.We are also planning to apply our method to other problems such as detecting small targets in remote sensing images.There, the ratio between target size and image size can be even smaller, so it is more challenging to accurately locate targets with relatively low computational cost.Generalization is also interesting: all of the datasets we used are from countries that drive on the right, and there is no database we know of from countries that drive on the left.

Fig. 1
Fig. 1 Examples of traffic signs and traffic lights in street views.Traffic signs and traffic lights are detected separately: (a) shows traffic signs only while (b) shows traffic lights only.Green rectangles on the left of each subfigure indicate attention regions; rectangles on the right are corresponding cropped attention regions and bounding boxes of targets within them.

Fig. 2
Fig. 2 System overview.Our attention proposal model has a similar architecture to faster RCNN but outputs bounding boxes of attention regions.The accurate locator and recognizer also uses a faster-RCNN-like model.It takes as input cropped and resized attention regions generated by the attention proposal model, and predicts bounding boxes and classes of the targets in attention regions.

Fig. 4
Fig. 4 The attention model improves small target detection performance for different sizes in the Tsinghua-Tencent 100K testing set.(a) The attention proposal model tends to propose attention regions at reasonable scales so that target sizes in the resized regions are concentrated in the range [32, 96] while in the original images they are more widely distributed.(b)-(e) Recall-precision curves for three methods on targets of different sizes.Our method outperforms faster RCNN and is competitive with that of Zhu et al.

Fig. 5
Fig. 5 The attention model improves small target detection performance on the Tsinghua-Tencent traffic light testing set.(a) As for traffic sign results, target sizes in the resized attention regions are more concentrated around medium sizes than those in the original images.(b)-(e)show the recall-precision curves of our method and faster RCNN for different target sizes; our method performs much better than faster RCNN on small targets.

Fig. 6
Fig.6 Example results for challenging cases in the Tsinghua-Tencent 100K dataset.Above: our results.Below: Overfeat results.The bottom right of each image shows a close-up of the region of interest.In the first two cases, our method detected small targets that Overfeat missed.In the last two cases, our method missed some targets, as the APM failed to propose the those regions.

Fig. 7
Fig.7 Results for some challenging cases in the Tsinghua-Tencent traffic light dataset.Above: images.Below: close-ups.Our method is robust to different lighting conditions, e.g., extremely dark regions under a bridge and bright regions under strong sun light.

Table 1
Network structure of the attention proposal model

Table 2
Performance of three methods on the TT100K dataset

Table 3
Number of instances for each major class in Tsinghua-Tencent traffic light dataset

Table 4
Performance of our method and faster RCNN on the Tsinghua-Tencent traffic light dataset

Table 5
Performance of our TTTL-trained model on the LISA traffic light dataset