Real-Time Detection and Recognition of Railway Traffic Signals Using Deep Learning

Automated detection and recognition of traffic signals are of great significance in railway systems. Autonomous driving solutions are well established for urban rail transportation systems. Many metro lines in service worldwide have reached the highest grade of automation where the train is automatically operated without any staff on board. However, autonomous driving is still an open challenge for mainline trains, due to the complexity of the mainline environment. In this context, automated recognition of wayside signals can help to minimise the risk of human error owing to low visibility and fatigue. It represents a key step towards the fully autonomous train. In this article we present a deep learning based approach for the above task. The You Only Look Once (YOLOv5) is used for detection and recognition of wayside signals. A heuristic is used to recognise blinking states. We consider FRSign dataset, a large collection of over 100,000 images of traffic signals from some of the trains in French Railways. A distilled and cleaned version of the dataset curated by us is used for training. The trained network has low computational overhead and can recognise traffic signals in real time and under diverse field conditions. It has robust performance even for complex scenes having multiple signals and light sources, and in adverse circumstances such as rain and night environments. The refined version of the dataset is published as open for validation and further research and development.


Introduction
Human errors in perception and judgment of traffic signals are critical in railway systems. Causes of such errors may be poor visibility conditions, operator fatigue, and lack of attention. Automated real-time systems for detecting and recognising track side traffic signals of railways supplement human attention and pave the way for fully autonomous operation.
According to the international norm "IEC 62267:2009 Railway applications-Automated urban guided transport (AUGT)-Safety requirements" (IEC 2009), four different Grade(s) of Automation (GoA) can be distinguished in railways, depending on the level of responsibility assigned to the driver (Fig. 1).
Autonomous driving solutions for urban systems (such as metro lines) have been deployed and widely adopted for decades now, up to GoA4, with the milestone of 1000 km of automated metros reached in March 2018 with the opening of the Pujiang Line in Shanghai (Simoni 2021). According to a 2018 report by the International Association of Public Transport (UITP), nearly a quarter of the world's metro systems have at least one fully automated line in operation (UITP 2019). Conversely, the automation of mainline railway transport is still at an early stage, with service applications mainly limited to GoA1, and only a few implementations of GoA2 (Emery 2017;Alstom 2019Alstom , 2020Siemens Mobility 2021;THAMESLINK PROGRAMME 2021).
Researchers within the Shift2Rail initiative have recently worked on specifications of ATO (Automatic Train Operation) over mainline signalling technologies to support transition to GoA3/GoA4 (Simoni 2021). Alstom has announced field trials of autonomous regional trains to begin in 2023 (Alstom 2020).
There are several reasons why higher grades of automation are easier to achieve in urban lines than in mainline railways. For instance, metros are mostly operated in a structured, closed environment, without level crossings or inter-stations rail crossing junctions. In mainline case, the open infrastructure is shared by multiple operators, for both passengers and freight traffic. It is subjected to very changeable and unpredictable perturbations due to weather events, interactions with humans, animals, vegetation and landscape. This makes the obstacle detection task much more challenging. Also, while the rolling stock type in a metro system is homogeneous, mainlines infrastructures are generally designed to guarantee interoperability of several types of trains. The presence of platform screen doors in metro stations provides an additional level of safety for GoA4 implementation. Finally, urban systems are typically operated at relatively low speeds on a line of limited length, while mainline trains may run at very high speeds and over a track layout spanning several thousands of kilometers. Hence, mainline railway systems can only afford small reaction times.
The superior complexity of mainline systems makes the task of automated detection and recognition of traffic signals significantly arduous. In comparison with road signals for example, railway mainline signals have more complex shapes and configurations while possessing a relatively large number of possible states. These include multiple combinations of lights also with blinking states, representing specific traffic control instructions. Light sources from other approaching trains and track side structures confound the problem. This makes the task of recognizing and interpreting lights indications more difficult as compared to road signal detection attempted in various self driving cars. Further, safety requirements are much more stringent in railways as compared to road vehicles (Rangra et al. 2018).
In this paper, we attempt to develop an affordable fastresponse system that can detect and recognise railway signals in realistic and challenging environments. The You Only Look Once (YOLO) version 5 (Jocher 2020) deep learning architecture has been used to reduce the time required for automated signal recognition while maintaining adequate accuracy. Note that, the recognition time in addition to driver response time constitute the real-time window in GoA1 and GoA2 grades of automation. To train the system, we leverage a large railway signalling dataset collected from trains in field operation (Harb et al. 2020), with video sequences acquired from cameras filming the railway from the train drivers point of view. The original dataset is big and it needs some pre-processing operations to make it more suitable for machine learning applications. It has been distilled into a smaller dataset after removing duplicates and cleaning erroneous annotations. Also, the dataset contains various types of signals with multiple combinations of lights/signal states, including blinking patterns. Despite being a key feature for practical use of vision-based driving assistance tools, automated detection and recognition of blinking lights has not received much attention in the literature. A computationally efficient method for recognition of states associated with blinking lights is proposed in the present study.
The proposed system could be integrated into a module for automatic generation and real-time update of maps of wayside signalling equipment. Such a module would represent the key feature of an advanced railway driver assistance system that pre-advises and swiftly warns the driver about the status of signals on the tracks ahead. A beneficial impact on the safety of the system, for instance in case of  according to the International  Standard IEC-62267 (UITP  2019) driver inattention, could be attained thanks to the proposed computer vision-based algorithm. Further, the development of an automated decision support system including the approach described in this paper would facilitate the transition towards increasingly higher degrees of automation for mainline railways.
The paper is organised as follows: next section describes outlines the related works in the literature. The methodology followed to address the fast traffic signals detection and recognition issue is discussed in Section "Methodology". In Section "Distilled French Railway Signal Dataset", we describe the dataset and how it has been distilled in order to train a high performance deep learning system. The fifth section presents the results and the performance of the system, with a focus on illustrative scenarios that highlight the robustness of the proposed method. The conclusions of the study are provided in the last section.

Related Work
Automated signal detection and recognition has been the focus of study for road traffic in the context of self-driving cars (Yuan et al. 2019). For instance, recognition of road signals in poor environmental conditions has been studied by Temel et al. (2020). A comparative analysis of several stateof-the-art object detection and tracking algorithms to detect and track different classes of road vehicles is presented in Mandal and Adu-Gyamfi (2020). Further, many open datasets are available for researchers working on autonomous vehicles (Harb et al. 2020). However, only a scant amount of literature has been reported on application of these methods on railway transportation. Early attempts were often based on traditional image processing techniques and on handcrafted features used in computer vision for object detection (Melander and Halme 2016). For instance, in Marmo and Lombardi (2006) the authors use classical methods such as histogram analysis, template matching and shape feature extraction to detect a specific type of signal and to classify it according to the colour of the light (green: GO, red: STOP). However, the detection is focused on a single element with binary state and the examples presented are scenes of low complexity. A neural network for detection and recognition of railway signs is discussed in Mikrut et al. (2015). The study considers only a very specific class of static railway warning signs-without lights-and no traffic signals have been taken into account.
With the advent of deep learning, some studies have been carried out for railway traffic signal detection and classification. According to the review study in Varghese et al. (2020) it has been found that for transport-oriented applications, deep learning models typically show better prediction accuracies compared to conventional machine learning models.
In Karagiannis et al. (2020), a Faster-RCNN-based object detection methodology is applied to detect railway signals and signs. The study mainly looked at static sign elements. Further, the results were compromised by the small size of the objects, the low resolution of the images and the high similarity across classes. A hybrid architecture combining Faster-RCNN with hierarchical feature networks has also been recently suggested by Choodowicz et al. (2020). The proposed method though relies on traditional image processing technique for recognition of light signals. Feature fusion based approaches for railway obstacle detection provide state of art performance for this task (Ye et al. 2021). Similarly, on-board embedded computer vision systems for recognition of railway objects has been proposed by Etxeberria-Garcia et al. (2020). A comprehensive review on applications of deep learning techniques on intelligent transportation systems, including some case studies from railways, is provided in Haghighat et al. (2020). However, the railway studies reported by the authors focus on obstacle detection and detection of tracks.
To the best of the authors' knowledge, large-scale study on detection and recognition of railway signals in field conditions with the goal of real-time operation has not been reported in literature so far. It must be noted that no previous works attempts detection and recognition of different classes of railway traffic signals with multiple states, including states associated with blinking lights. A detailed study on the performance of such systems is necessary for field implementation. In this work, we present a deep learning based system trained on a large collection of real railway traffic images collected on mainline trains in operation on the French Railway system (Harb et al. 2020) for this purpose.

Methodology
We follow a two-step procedure for detecting the static and the blinking states, respectively. First, a YOLOv5 deep learning architecture is trained using image frames from a railway signal image frames dataset captured from specific trains of the French Railway. The trained network is then used to detect traffic signals in individual frames. The frames containing traffic signals are passed through the classification modules of the deep learning architecture to recognise the signal states. A bounding box around the signal is also generated. Since some of the signals involve blinking lights we need to further postprocess these information. For this, in the second step, we consider a sequence of m bounding boxes in successive frames. A simple heuristic is used to detect signal blinks in these sequences. If no blink is detected the original signal states are inferred. If a blink is detected, special signal states involving blinks are predicted. The scheme is illustrated in Fig. 2.
For the detection and recognition step, we use the You Only Look Once version 5 (YOLOv5) deep network. It produces state-of-art recognition in various applications. Also it is very fast in producing detection and recognition outputs. This makes it suitable for real-time applications. We describe in details the YOLOv5 network in the next section, followed by the heuristic for blinking lights detection. YOLO (Redmon et al. 2016) is a state-of-the-art, real-time object detector, and YOLOv5 (Jocher 2020) is based on YOLO1-YOLOv4 (Du 2018;Bochkovskiy et al. 2020). This is the latest development of the You Only Look Once architecture series. The detection accuracy of this network model is high, and the inference speed is fast, with the fastest detection speed being upwards of 140 frames per second. On the other hand, the size of the weight file of YOLOv5 target detection network model is small, which is nearly 90% smaller than YOLOv4, indicating that YOLOv5 model is suitable for deployment in embedded devices to implement real-time detection. In summary, the advantages of YOLOv5 network are its high detection accuracy, lightweight characteristics, and fast detection speed at the same time (Fang et al. 2021).

YOLOv5 Deep Network
The YOLOv5 framework mainly consists of three components, including: backbone network, neck network and detect network. These are shown in Fig. 3. Backbone network is a convolutional neural network that aggregates different finegrained images and forms image features. The backbone network consists of four modules, namely, the focus module used to reduce computation time, concat module used to connect various layers through a Leaky ReLU activation layer, the bottleneck cross-stage-partial (CSP) module consisting of convolution and residual network layers, and the spatial pyramid pooling module.
The neck network is composed by a series of feature aggregation layers of mixed and combined image features and it is mainly used to generate feature pyramid networks (FPN).
The feature map computed by the neck network is then transmitted to the detect network (prediction network). This network is mainly used for the final detection part of the model, which applies anchor boxes on the feature map output from the previous layer. The final output is represented by a vector with the category probability of the target object, For the application presented in this study, the overall network has a dual goal: (i) classify the railway signals and (ii) to localize them using bounding box. For the purpose of classification, the binary cross entropy with logits loss is used as the loss function for training. Binary cross entropy H p (q) is expressed as the negative average of the log of corrected predicted probabilities, as defined in Eq. 1.
where N is the total number of examples, and y i is the true class label, and p(y i ) is the probability of y i being the predicted class label. The probability is calculated using the logit function.
Bounding box regression is used for the purpose of signal localization. The loss metric used for bounding box regression is based on the popular intersection over union (IOU) measure computed between the true bounding box and the predicted bounding box. Three geometric factors in bounding box regression, i.e. overlap area, central point distance and aspect ratio, have been used to define the Complete IoU (CIoU) loss in Zheng et al. (2020). This is used as the loss function during training of YOLOv5. A batch size of 32 images is used during the training phase.
We use a pre-trained YOLOv5 network (Jocher 2020). It is pre-trained on the MS COCO dataset. COCO stands for Common Objects in Context. It is a large-scale object detection, segmentation, and captioning dataset benchmark (Lin et al. 2014). It has been observed in previous studies that dropout and batch normalization does not reduce overfitting in YOLOv5 ). However, the hardswish activation and the multiple loss functions used in YOLOv5 is expected to reduce overfitting.

Blinking Signal State Detection
In the general object detection framework, we provide as an input one single image at a time, and the corresponding "static" label of the object is obtained. In addition to that, to detect the objects associated with of blinking states such as 'feu rouge clignotant' (i.e. red blinking light) and 'feu jaune clignotant' (i.e. yellow blinking light), we have introduced the following heuristics.
Let us consider the current timestamp t = k . We have the information of last m frames ( t = k − m to t = k ). We do not need the information as a complete image but only the coordinates and the label present in each previous m images. This is illustrated in Fig. 4. The methodology for detecting and recognizing the blinking signals are mentioned below: 1. Let o be the count of 'off' images (those without any detected bounding box) in the last m frames of a particular object. 2. The frame sequences, where the likelihood ratio r = o m lies in the range 0.3 ≤ r ≤ 0.7 , is considered to be a 'blinking signal'. 3. We update the class label if and only if the output from the blinking detection heuristic of the previous step is 'blinking signal'. Otherwise, we do not change the label that was initially provided by the object detection model.
Multiple signal bounding boxes may be present in some of the sequences involving blinking signals. In the above method, it is required to establish correspondence between identical signal bounding boxes for consecutive frames. The steps for correspondence identification of bounding boxes are mentioned below: 1. Distance between every pair of bounding boxes present in two successive frames is computed as the distance between positions of the bounding box centres. 2. Nearest neighbor between all pairs of bounding boxes in successive frames are identified. The nearest bounding boxes are assumed to be representing the same signal light.
Because of consistency over frame sequences in a small time window, the above method was found to provide reasonably good performance. While the presence of background elements, obstacles, and multiple objects in the scene make the automated detection and recognition tasks particularly challenging, the successful application of the proposed method on images collected in field operation proves its effectiveness and its suitability for real-world sequences. Like most other object detection tasks in the transportation domain, the FRSign dataset has significant class imbalance. Only about 5% of the frames have a signal present in them. Instances of some of the signal classes, particularly those involving blinking lights, are relatively scarce in the dataset.

Railway Traffic Lights
The shapes of various traffic signals are shown in Fig. 5. The states of the traffic lights (and the associated traffic management information) for panels A, C, F, and H are shown in Table 1. Two example images from the dataset along with their annotations are shown in Fig. 6. A comprehensive description of the traffic lights states with the list of possible combinations for the different panel shapes considered in this study can be found in Harb et al. (2020). The original dataset is large in size (approximately of 289GB). However, a large number of images are almost duplicates as shown in Fig. 7. We have performed semiautomatic image screening to obtain a distilled dataset containing only distinct images. Also, the dataset contains a number of wrongly annotated images as shown in Fig. 8. We have also cleaned such annotations by semi-automatic processes.
The perceptual hashing algorithm is used for detecting near duplicate images in the collection (Hamon et al. 2006). Perceptual hashing is a special type of locally sensitive hashing and is effective in detecting image frames with slow changing differences among them. We use the open source library pHash (Zauner 2010) to implement the near duplicate detection scheme. We identify all images whose classification differs from those of adjoining frames. Class annotations of these images were inspected and re-annotated to remove the anomalies.
A distilled and cleaned dataset obtained from the FRSign the original dataset is available online. 2 There are a total of 2800 distinct images in the reduced and cleaned dataset. A higher accuracy is obtained by training on this reduced dataset instead of the original dataset containing duplicate images and wrong annotations.

Results and Discussion
In this section, we present the results of our experiments on the distilled FRSign data set of images. First of all, we discuss the evaluation metrics used in our study. Then, we present comparative results of the performance of various algorithms. Finally, we perform an error analysis of the images where the proposed algorithm fails/succeeds.

Evaluation Measures
The system initially predicts a bounding box in the image where a traffic signal is present. It then classifies the object within the bounding box to one of the signal classes. A NULL class denotes that no signal is present inside the bounding box. We also have a test set where the actual class of the signal state is known. The actual class is compared to the predicted class within the bounding box to evaluate the goodness of the algorithm. Precision and Recall are used as the evaluation measures for this purpose. Precision (P) is defined as the ratio of the number of test set images which actually belongs to a certain signal state C as well as truly contains that signal state C, to the number of test set images which are predicted to have a bounding box containing a signal with that state C. This is defined in Eq. 2.
Similarly, recall R is defined as the ratio of the number of test images predicted to have signal state C to the number of test images actually containing a signal state C. This is defined in Eq. 3. Both precision and recall values are agglomerated over all the classes as described subsequently. While precision measures how accurate the prediction system is, recall measures the coverage of the prediction system.
The values of precision and recall of the system may change if the intersection over union (IOU) threshold used for determining the class label is varied. Note that, IOU denotes the ratio of the intersection area between a detected bounding box and the true bounding box of a signal to the union of these two areas. A low IOU value below the threshold leads us to classify the signal belonging to the NULL class, signifying no signal is detected. Only if the IOU value is above a threshold we assign a class to the box. This inferred class may be correct or a misclassification. NULL classes are also considered as a misclassification, thus affecting P and R values. A threshold value leading to a high precision may result in a poor recall and vice versa. This is accounted for by considering the mean average precision (mAP) computed as mentioned below. For a single class, the IOU threshold is varied over the interval [a, b] in small increments to obtain a range of precision and recall values. The area under the curve formed by Precision and Recall gives the AP (Average precision). Finally, the average of AP@[a; b] over all the classes is defined as Mean Average Precision (mAP@ [a, b] For the blinking signal classes, instead of bounding boxes in a single image frame the classification is considered for a sequence of frames through which the signal blinks. While computing precision and recall value such sequences along with their predicted class labels are counted instead of single frames. Let S m = {X 1 , X 2 , … , X m } be a sequence of m images detected to be part of a blinking signal sequence. We compute the majority class C to which the frames in this sequence is classified to. We then assume that atomically the sequence S m is classified as class C. Precision and recall values are calculated for these sequence in terms of the number of sequences S m recognised and the number of true sequences in a particular class.
We consider 80% of the examples as the training set and remaining examples as the test set. Such random training set versus test set splits are performed 10 times and the average evaluation measures are reported.

Results
We compare three algorithms with respect to the evaluation metrics described above, namely: YOLOv3 (Bochkovskiy et al. 2020), YOLOv5 (Jocher 2020) and Efficient-Det (Tan et al. 2020). The YOLOv3 is a faster version of YOLO but has a lower detection accuracy, whereas YOLOv5 has a higher accuracy but takes relatively more processing time. EfficientDet is bi-directional feature pyramid based networks with a scaling layer. It is known to provide good performance in resource constrained situations. For the FRSign dataset, the EfficientDet algorithm provides weaker  We also compare the computation time in milliseconds required by the algorithms to infer the class label for a single image frame. Computation time is reported for standard virtual machine available in Google COLAB having a Tesla T4 GPU with 16GB RAM. The CPU has 40 virtual cores with clock speed of 1.39GHZ. The results are presented in Table 2. Note that, evaluation measures for both blinking and non-blinking states are reported in the table. It is observed that YOLOv3 is faster but has poorer performance than the YOLOv5 algorithm. The EfficientDet is significantly slower than both the YOLO algorithms.
For each image, the YOLOv5 algorithm requires 17-20 ms to provide the detection output, thus nearly 50 frames/ second for 1024×1024 image can be detected by the YOLOv5 algorithm. This makes the system suitable for real time operation.
Time required for the blinking detection algorithm on some example sequences are presented in Table 3. It is observed that the YOLOv5 combined with the blink detection algorithm can together infer labels in real time.

Analysis of Results
We analyse the cases where the system wrongly classifies the signal states. However, the confidence of the system for all these erroneous predictions were observed to be low. The situations where the system might fail can be summarised as: 1. Extremely poor illumination conditions may lead to error in detection. An example is shown in Fig. 10. Such situations can be handled using image pre-processing to improve brightness. 2. Small size of the signals in distant frames may also lead to errors in detection. This may be corrected by considering consensus of the outputs over multiple frames where more weight is given to closer images. An example is shown in Fig. 11. 3. Shapes of the signals from certain track curvatures may be misleading and lead to errors. An example is shown in Fig. 12. Again this may be compensated by considering voting among predictions over multiple frames.
4. Cluttered scenes particularly in railway yards and intersections may lead to detection error too. An example is shown in Fig. 13. This is by far the most difficult error to correct. Manual overrides may be employed in such situation.
We also find that the system classifies the signals correctly in many difficult scenes. Notable among them follows: 1. The YOLOv5 works even in rainfall and cloudy conditions. This can be seen from the example images in Fig. 14. 2. The system is also found to work in scenes where other prominent light sources are visible. Light sources may include headlight of approaching trains, road traffic lights etc. Some example images are shown in Fig. 15. 3. As shown in Fig. 16, the system works even in night condition with reasonable visibility. 4. The system can also detect signals located far away from the train, ahead in the track. This is shown in Fig. 17. 5. Multiple signal lights present in a single frame have also been found to be correctly detected by the system. This is illustrated in Figure 18.
Further analysis of the errors and success of the system is being carried out on a larger set of images from the dataset. It may be noted that the dataset is unbalanced in terms of the signal classes. This will be balanced in future with more images being processed.
Although the number frames having blinking signal classes are relatively less, the blink processing heuristic is successful in recognising a majority of them. Figure 19 shows a sequence of four blinking frames that are detected correctly. Similarly a sequence of four misclassified frames are shown in Fig. 20. The misclassification is owing to the failure of the blink detection heuristic in low light condition.

Conclusions
We present a deep learning-based methodology for detection and classification of railway signals. The YOLOv5 architecture is used for this purpose. A dataset from French Railways has been studied. We obtain a high precision and recall for the detection task. The computational efficiency of the YOLOv5 architecture makes the method suitable for real time application. We also propose a sequence based blink detection heuristic to detect blinking signal states. Deep neural architecture combining convolution, recurrent, and attention structures may be used to handle such dynamic signal states.
The system has been found to be robust in night, rain and other adverse environmental conditions. It has also been found to perform satisfactorily in complex scenes containing several other light sources and various similar objects. It can also detect signals form a far distance making it a supplementary system for human operators. The deep learning system trained using the French Railway dataset can be easily transferred to build signal detection systems in other railways with minimum amount of additional training data. The architecture is also suitable for real time on-board operation because of its very low computational time requirement on low cost computing platforms.
Funding Open Access funding provided by the IReL Consortium.

Conflict of Interest
The authors declare that they have no conflict of interest with respect to the research, authorship, and/or publication of this article.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.