1 Introduction

Various types of Advanced Driver Assistance Systems (ADAS) have been developed for safe driving of different types of private and public vehicles. In 2014 the Society of Automotive Engineers International (SAE) classified 6 levels of driving automation, from the most basic systems (level 0) to 100% autonomous driving one (level 6) [1]. Several research on the ADAS demonstrate that passive forward collision warning (FCW) and intelligent speed assistance (ISA) systems have been efficacious in reducing the number of pedestrian/vehicle collisions [1,2,3]. ADAS can be defined as a vehicle-based intelligent safety system which may well increase road safety in terms of crash avoidance, crash severity mitigation and user’s protection during post incident phases. The most well-known ADAS systems include several technologies such as:

  • Collision Warning (i.e.: Blind Spot Warning; Forward Collision Warning; Lane Departure Warning; Parking Collision Warning; Rear Cross Traffic Warning);

  • Collision Intervention (i.e.: Automatic Emergency Braking; Automatic Emergency Steering; Reverse Automatic Emergency Braking (AEB));

  • Driving Control Assistance (i.e.: Adaptive Cruise Control (ACC); Lane Keeping Assistance; Active Driving Assistance);

  • Parking Assistance (i.e.: Active Parking Assistance; Remote Parking Assistance; Trailer Assistance);

  • other driver assistance systems (Automatic High Beams; Driver Monitoring; Head-Up Display; Night Vision).

Currently ADAS systems are largely used in cars and trucks. Autonomous buses (or self-driving, driverless buses and automated shuttles) are in the experimental stage in numerous cities worldwide thanks to their high automation levels [2,3,4]. Experiments have been conducted in controlled zones such as the university campus, parking areas and small villages [4]. Autonomous vehicles can be classified as follows: (1) private autonomous cars, (2) shared autonomous cars/taxis, (3) autonomous buses and trucks; (4) autonomous Trams (5); autonomous-rail rapid tram (ART) [4,5,6,7].

The International Association of Public Transport (UITP) defines ATO, ATC, ATP as Grades of Automation (GoA). Each GoA is identified by the operational responsibilities of the train basic functions either to an automatic system or to man. In particular, the UITP identify the following 5 GoA levels [8]:

According to [9, 10] the grade of automation (GoA) of metro lines can be categorized into four categories (Fig. 1):

  • GoA1 (Operation with a driver). The driver of the Tram is actively involved in all the driving activity and the vehicles is equipped with an ATP (Automatic Train Protection system);

  • GoA2 (Semi-automatic Train Operation (STO)). The driver is involved in driving only if a failure occurs, and is responsible for opening and closing the doors. The Tram is equipped with ATP and ATO (Automatic Train Operation system);

  • GoA3 (Driverless Train Operation (DTO)). The Tram does not require a driver. An attendant is answerable for the opening and closing of the doors and intervenes in the event of failures. Tram is equipped with ATP and ATO systems.

  • GoA4 (Unattended Train Operation (UTO)). The Tram does not require a driver and attendant; it is equipped with ATP and ATO systems.

Fig. 1
figure 1

Classification of grade of automation (adapted from [9, 10])

Similar classification can be used in case of tramway systems. Consequently, trams operation can be assumed to be automatic if the vehicles are driverless (GoA4 and GoA3).

Driverless systems applied to public transport mode may improve capacity, efficiency and safety and decrease the operational costs (lower operation personnel cost) and the road congestion. In particular, the autonomous buses and autonomous-rail rapid trams could be characterized by demand-driven schedules and therefore could potentially dynamically regulate their trajectory, capacity and stopping in accordance with users demand. Several researches have shown positive attitudes of passengers towards autonomous buses [4, 11,12,13]. Some Adaptive Autonomous vehicles as ART use ACC in order to increase the transportation system capacity. In operating conditions ACC allows small distance between each vehicle pair of a fleet.

In general, the main purpose of autonomous systems is to safely drive a vehicle without human supervision. For long time, researchers have proposed several artificial vision algorithms for implementing automatic systems into private or public vehicles but many problems including changes in obstacles, lighting, background and speed turns this technical approach into a difficult task with no simple solution [14].

The main problem in the ADAS systems is the obstacle detection.

Nowadays several obstacle detection technologies are used in this field, based on sensors that allow obtaining several information including the presence of pedestrians, cyclists and vehicles along the way. The most common Detection systems can be classified as follows [15]: 1) Active detection that uses radar systems; 2) Passive detection founded on visual detection technology (the vehicle is equipped with a CCD camera in order to obtain a large amount of information directly from the environment).

In smart cities of the near future, there will be many types of autonomous vehicles, including trams. ADAS systems operating on trams will improve the safety of public transport systems and thus the safety of the entire mobility system of the smart city.

Unlike the ADAS operating on cars, in tramway systems the ADAS should identify and classify only the anterior obstacles, this is because the trajectories of tram vehicles are imposed by the rails. In addition, for each time instant, ADAS should calculate the Emergency braking curve EBD that describes braking distance in function of the instantaneous speed of trams [16]. Over the years a large number of accidents resulting in many fatalities and a high social economic cost have occurred due to human factors-related problems [17].

After many decades of stagnation, traditional and innovative tramway systems (i.e. catenary-free systems) are currently experiencing a period of strengthening with a notable expansion in new railway lines [18, 19]. In fact, since 2000, about 130 new tramway systems have been constructed in the world and many other existing systems have undergone deep renovations [1]. Nowadays, modern trams enjoy great popularity because they are characterized by low environmental impact, adequate construction and maintenance costs, and high transport capacity. At the same time, the number of people using cycling and other types of micro-mobility systems is increasing in cities, which causes a greater number of dangerous situations [20]. In urban contexts walking is a fundamental mode of transportation. Nevertheless, more than 7000 pedestrians were killed each year in the European Union, equal to 27% of all traffic victims [21]. According to [18, 22] some of the most important hazards in tram traffic are: 1) Collision between trams, 2) Tram collision with other types of vehicles, cyclists or pedestrians, 3) Running into an obstacle on the tracks. In traditional tramway systems the train driver must scan the approaching physical environment for the presence of hazards, warnings and imposed speed limits [23]. In order to reduce such hazards are required several actions both from the transportation system (operators, train drivers, infrastructure management, etc.) and the road users.

The introduction of Self-driving technologies could increase the quality of service, the system reliability and the safety of these transportation systems [24] both in traditional and smart cities. A Self-driving vehicle necessitates of a multimodal suite of sensors (e.g. dual antenna GNSS-aided Inertial Navigation System (INS), Lidar, Radar and Cameras) to perform the localization, signal handling, and obstacle handling tasks [25]. In this direction, Bombardier, Alstom, CRRC, Siemens and other Companies have already carried out practical experimentations [24, 25]. Nevertheless, the complex traffic environment and the correlated hazards make the realization of unmanned vehicles problematic in tramway systems. Many researches and experiments are required above all in the field of object recognition and tracking [26]. Presently, in-vehicle detection technology is widely applied to the smart car field and it is divided into active detection and passive detection [6].

This research describes a technique for pedestrians, vehicles and cyclists recognition along a tramway infrastructure by Computer Vision and deep learning approaches. The proposed technique guarantees high accuracy and precision in object detection and therefore could be used in advanced driver assistance systems or in autonomous trams and autonomous rapid trams (ARTs) in order to achieve a high level of safety. The paper is structured as follows. Section 2 presents an overview of relevant literature of object detection and recognition systems based on deep learning techniques and YOLOv3 system. Section 3 briefly explains the proposed technique applied to a case study in urban context. Section 4 deals with the neural network training; Sect. 5 presents the model for the estimation of the distance between pedestrians and rails; instead the main results are presented in Sect. 6. Finally, conclusions are proposed in Sect. 7.

2 Methodology for the Detection of Road Users

The detection of pedestrians, cyclists and other road users play a key role in automatic driving emerging technologies. In fact, errors in the detection of road users could threaten users' lives [27]. Therefore, the performance of road users detection algorithms is of great importance in the field of autonomous vehicles.

This research presents a methodology based on computer vision, deep learning and Yolov3 algorithms applied to the recognition along a tramway infrastructure of pedestrians, vehicles and cyclists.

In accordance with the scientific literature, object detection algorithms are mainly subdivided into two classes: the one-stage method [8, 28, 29] and the two-stage method [30, 31]. One-stage method classifies the object and renders a regression of the object location from the raw image. Instead, in the two-stage method the detection process requires the following phases: 1) extraction the region of interest (ROI) from an image where the objects of interest may be, and 2) correction and recognition of the candidate region [32]. YOLO is a typical one-stage detection algorithm, which was originally proposed by Redmon [8, 28] to achieve end-to-end target detection based on a single CNN model. Nowadays Object detection systems like YOLO (You Only Look Once), SSD (single-shot detector), and Faster R-CNN, not only classify images but also can locate and detect each object in images that contain multiple objects [33].

The structure of YOLO is composed of two different parts [8]: the first one is the feature extraction network that allows the general features of the researched object [34]. The second part of YOLO is a post-processing network which aims to evaluate the coordinates and the categories of the object under analysis. YOLOv3 is one of the latest versions of YOLO networks. It does not require a region proposal network (RPN) and directly performs regression to detect targets in the image [35].

The YOLOv3 algorithm uses a new network for extracting the object features. YOLOv3 includes 53 convolutional layers (Darknet-53) and 23 residual layers as displayed in Figs 2. YOLOv3 is characterized by a remarkable advancement in real-time object detection, particularly in the detection of smaller objects. In addition, YOLOv3 uses a multi-label classification. Therefore, in this research YOLOv3 is chosen as the detection system for vehicles, pedestrians and other road users along the analysed tramway line. As well explained in [35], 1 × 1, 3 × 3/2, and 3 × 3 convolution kernels of three sizes are applied in the convolutional layers to sequentially extract image features, ensuring the model has remarkable classification and detection performances.

Fig. 2
figure 2

YOLOv3 Network structure (adapted from [36])

The remaining layers guarantee the convergence of the detection model [35]. For detecting the parts of the object of interest at the same time, YOLOv3 fuses three feature maps of different scales (52 × 52, 26 × 26, and 13 × 13) by three times of downsampling. In conclusion, the YOLO algorithms family (i.e. YOLOv1, YOLOv2, YOLOv3) is a series of end-to-end deep learning models planned and designed for fast real-time object detection [33].

In YOLOv3 algorithm, the Loss function is composed by the following parts: classification loss, localization loss (errors between the predicted boundary box and the ground truth) and confidence loss (the objectness of the box) [37] as follows:

  • classification loss: if an object is detected, the classification loss at each cell is the squared error of the class conditional probabilities for each class [37]:

    $$\sum\limits_{i=0}^{{S}^{2}}{I}_{ij}^{obj}{\left({p}_{i}\left(c\right)-{\widehat{p}}_{i}\left(c\right)\right)}^{2}$$
    (1)

    where

    \({I}_{ij}^{obj}\)=1: if an object appears in the cell i, otherwise is 0;

    \({\widehat{p}}_{i}\left(c\right)\): denotes the conditional class probability for class c in the cell i.

  • Localization loss: evaluates the errors in the predicted boundary box locations and sizes. It is only counted the box responsible for detecting the object [37]:

    $$\begin{array}{l}{\lambda }_{\text{coord}}\sum\limits_{i=0}^{{S}^{2}}\sum\limits_{j=o}^{B}{I}_{ij}^{obj}\left[{\left(x-{\widehat{x}}_{i}\right)}^{2}+{\left(y-{\widehat{y}}_{i}\right)}^{2}\right]+\\ {\lambda }_{\text{coord}}\sum\limits_{i=0}^{{S}^{2}}\sum\limits_{j=o}^{B}{I}_{ij}^{obj}\left[{\left(\sqrt{{\varpi }_{i}}-\sqrt{{\widehat{\varpi }}_{i}}\right)}^{2}+{\left(\sqrt{{h}_{i}}-\sqrt{{\widehat{h}}_{i}}\right)}^{2}\right]\end{array}$$
    (2)

    where

    \({I}_{ij}^{obj}\) = 1: if in the jth boundary box in cell i is responsible for detecting the object, otherwise 0;

    \({\lambda }_{\text{coord}}\): increase the weight for the loss in the boundary box coordinates.

  • Confidence loss: if an object is detected in the box, the confidence loss (measuring the objectness of the box) is [37]:

    $$\sum\limits_{i=0}^{{S}^{2}}\sum\limits_{j=o}^{B}{I}_{ij}^{obj}{\left(C-{\widehat{C}}_{i}\right)}^{2}$$
    (3)

    where

    \({\widehat{C}}_{i}\): stands for the box confidence score of the box j, in the cell i.

    \({I}_{ij}^{obj}=:\)if the jth boundary box in the cell i is responsible for detecting the object, otherwise 0.

    If an object is not detected in the box, the confidence loss is [37]:

    $${\lambda }_{\text{noobj}}\sum\limits_{i=0}^{{S}^{2}}\sum\limits_{j=o}^{B}{I}_{ij}^{\text{noobj}}{\left(C-{\widehat{C}}_{i}\right)}^{2}$$
    (4)

    where \({\lambda }_{\text{noobj}}\): is the complement of \({I}_{ij}^{obj}\); \({\widehat{C}}_{i}\): is the box confidence score of the box j in the cell i; \({\lambda }_{\text{noobj}}\): weights down the loss when detecting background.

The final loss adds localization, confidence and classification losses together [37], as follows:

$$\begin{array}{l}{\lambda }_{\mathrm{coord}}\sum\limits_{i=0}^{{S}^{2}}\sum\limits_{j=o}^{B}{I}_{ij}^{obj}\left[{\left(x-{\widehat{x}}_{i}\right)}^{2}+{\left(y-{\widehat{y}}_{i}\right)}^{2}\right]\\ +{\lambda }_{\mathrm{coord}}\sum\limits_{i=0}^{{S}^{2}}\sum\limits_{j=o}^{B}{I}_{ij}^{obj}\left[{\left(\sqrt{{\varpi }_{i}}-\sqrt{{\widehat{\varpi }}_{i}}\right)}^{2}+{\left(\sqrt{{h}_{i}}-\sqrt{{\widehat{h}}_{i}}\right)}^{2}\right]+\\ \sum\limits_{i=0}^{{S}^{2}}\sum\limits_{j=o}^{B}{I}_{ij}^{obj}{\left(C-{\widehat{C}}_{i}\right)}^{2}+{\lambda }_{\mathrm{noobj}}\sum\limits_{i=0}^{{S}^{2}}\sum\limits_{j=o}^{B}{I}_{ij}^{noobj}{\left(C-{\widehat{C}}_{i}\right)}^{2}\\ +\sum\limits_{i=0}^{{S}^{2}}{I}_{ij}^{obj}{\left({p}_{i}\left(c\right)-{\widehat{p}}_{i}\left(c\right)\right)}^{2}\end{array}$$
(5)

In YOLO v3 the method for predicting the bounding box is given by Eqs. (6):

$$\left\{\begin{array}{l}{b}_{x}=\sigma \left({t}_{x}\right)+{c}_{x}\\ {b}_{y}=\sigma \left({t}_{y}\right)+{c}_{y}\\ {b}_{w}={p}_{w}{e}_{{t}_{w}}\\ {b}_{h}={p}_{h}{e}_{{t}_{h}}\end{array}\right.$$
(6)

In which tx, ty, tw, and th are the predicted outputs of the model, which denote the relative position coordinates of the center of the bounding box and the relative width and height of the bounding box. cx and cy denote the net, instead pw and ph are the width and height of the predicted front bounding box. Finally bx, by, bw, and bh are the true coordinates of the center of the bounding box, the true width and height of the bounding box obtained after prediction (Fig. 3).

Fig. 3
figure 3

Bounding Box with dimensions priors and location prediction (source: [8, 28])

The performance of a certain object detector can be measured mainly by the following metrics:

Frame per second (FPS) to measure detection speed (number of images processed every second);

Precision-recall curve (PR curve) in which Precision and Recall are calculated as follows:

$$\mathrm{Re}\;\mathrm{call}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}$$
(7)
$$\mathrm{Pre}\;\mathrm{cision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}$$
(8)
$${\text{Accuracy}}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}}\cdot 100$$
(9)

The symbols TP, FN and FP stand for True Positive, False Negative and False Positive respectively.

3 The Case Study

The city of Palermo (Italy) covers a territory of 158.88 km2 with a population of 666,992 inhabitants. Apart from the capital, the metropolitan area includes 26 towns, with a total population of around 1 million inhabitants. The following tramway lines have been running since 2015:

  • Line 1 “Roccella”: about 5.5 km long, with double tracks;

  • Line 2 “Borgonuovo –Notarbartolo”: 4.8 km, with double tracks;

  • Line 3 “C.E.P. – Notarbartolo”: 5 km, with double tracks;

  • Line 4 “Notarbartolo-Calatafimi-Notarbartolo”: 8 km with single track.

The municipality expects to extend the tramway system with seven new lines called “A, B, C, D, E, F, G”, with a total of more than 68 km.

The experiments of this research have been conducted into the tramway Line 2 “Borgonuovo –Notarbartolo” in the city of Palermo (Italy) and more precisely in the tree-arms roundabout of Fig. 4 crossed by different tramway sections. From the geometric point of view, the roundabout is characterized as follows: external diameter: 24 m; width of circulating carriageway: 6 m; entry lanes width: 3.50 m; exit lanes width: 3.50 m. In the first phase of the research, Deep learning and Yolov3 algorithms are used for detecting vehicles and pedestrians in the tramway track space (i.e. road users that cross or travel along the track). For this purpose several runs of a survey vehicle (Fig. 5) were made traveling in the roundabout, following the tramway track. The detection processes are investigated via a video camera installed in the survey vehicle (Fig. 6a). Traffic video recordings, with the resolution of 1280 × 720 pixels, were analyzed using a workstation with Intel(R) Core(TM) i7-4510 CPU @ 2.00 Hz 2.60 GHz – Memory RAM 20 Gb, Windows 10 Home. The first research phase was the Camera calibration. Camera calibration aims at establishing two sets of parameters: intrinsic and extrinsic. The intrinsic parameters are related to the focal length and optical center. The extrinsic parameters (pitch angle, yaw angle, and height above ground) determine the spatial offset of the camera. The Zhang algorithm [38] determines the extrinsic parameters of the system. For the calibration processes 64 different images of the chessboard of Fig. 6b have been considered using a set of different photos. Figure 6c shows the Extrinsic Parameters visualization obtained by the calibration processes. The calibrate model was then validated by several tests concerning the comparison between estimated and measured values of tramway gauge (see Sect. 5.1).

Fig. 4
figure 4

Roundabout analyzed (in red the tramway tracks)

Fig. 5
figure 5

Hypothetical Tram sensors and test vehicle used in the research

Fig. 6
figure 6

Camera calibration procedure

4 Training of the Neural Networks

As explained in previous Sect. 2 YOLOv3 includes 53 convolutional layers, hence the name Darknet-53. In Darknet-53 the weights of the custom detector are saved for every 100 until 1000 iterations, and it continues to be saved for every 10,000 iterations until it reaches the maximum batches [39]. The pre-existing road users dataset (including light and heavy vehicles, motorbikes, pedestrians and other road users) in Darknet-53 was split into 75% for training and 25% for testing.

In the course of the training process phase, data augmentation procedures (cropping, padding, flipping, etc.) were applied with the aim to prepare the large neural networks. Before, a bounding box labelling tool [40] has been applied to manually detect and recognize road users (vehicles and pedestrians) for the object to be detected [39] in the tramway track space. The outcomes of this phase and the class label consist of four points of the position coordinate. The label is converted into YOLO format and the tool changes the values to a format which the training algorithm YOLOv3 can employ. Figure 7 shows the training process consisting of 2000 iterations. The Accuracy (cf. Equation 9 and Fig. 7), the Loss (cf. Equation 5) and the Precision (Eq. 8 and Fig. 8) values demonstrate that the proposed training model detects the pedestrian and the vehicles with high accuracy.

Fig. 7
figure 7

Accuracy of the pre-trained networks

Fig. 8
figure 8

Loss function of the pre-trained networks

5 Estimation of the Distance Between Pedestrians and Rails

5.1 Tramway Track Detection

The rails boundaries detection and tracking are based on the algorithm carried out by Nieto [41] for the detection of lane markings in roads and highways. In addition, a Top-hat filter was used to correct the non-uniform brightness of the original image [42].

Let I be the original image under analysis, the “Top-hat” algorithm can be written as follows:

$${\mathrm I}_{\mathrm t}=\left(\mathrm I-\mathrm I\odot\mathrm B\right)$$
(10)

Being ⨀ the binary morphology operator that denotes a morphological opening operation and B is the structured element, which is a matrix composed of binary values (0 or 1) and It is the filtered image. In this research the structured element is a circle with a radius of a 5 pixel. After having determined the filtered image, the algorithm determines its complement. The difference is used as the pixel value in the output image. The following types of elements are taken into account within any plane image: Pavement, tramway rails, Objects (i.e. vehicles, pedestrians and other road users). The object segmentation is based on the Bayesian decision theory according to the conditions given below. Let S = {P, R, O, U} be the set of classes that characterize, respectively, the pavement, the rails of the tramway track, the objects (i.e. vehicles, pedestrians and other road users) and the unidentified elements.

The main target of the classifier is to assign one of the classes {P, R, O, U} to each pixel of the image [42]. According with [41] we denote with Xi the event that a pixel whose coordinates are (x, y) and with an associated observation vector denoted as zxy is classified as belonging to the class i ∈ S. By means of the Bayesian decision theory, the classification is obtained by selecting the class that maximizes the a posteriori conditional probability P(Xi|zxy) as follows:

$$\mathrm{P}\left({\mathrm{X}}_{\mathrm{i}}|{\mathrm{z}}_{\mathrm{xy}}\right)=\frac{\mathrm{p}\left({\mathrm{z}}_{\mathrm{xy}}|{\mathrm{X}}_{\mathrm{i}}\right)}{\mathrm{P}\left({\mathrm{z}}_{\mathrm{xy}}\right)}$$
(11)

Being \(\mathrm{p}\left({\mathrm{z}}_{\mathrm{xy}}|{\mathrm{X}}_{\mathrm{i}}\right)\) the probability that a pixel belongs to class i; \(\mathrm{P}\left({\mathrm{X}}_{\mathrm{i}}\right)\) the prior probability of each class and \(\mathrm{P}\left({\mathrm{z}}_{\mathrm{xy}}\right)={\sum }_{\mathrm{i}\in \mathrm{S}}\mathrm{p}\left({\mathrm{z}}_{\mathrm{xy}}|{\mathrm{X}}_{\mathrm{i}}\right)\mathrm{P}\left({\mathrm{X}}_{\mathrm{i}}\right)\) a scale factor that ensures that the posteriors sum to unity. Then, it can be estimated the probability that a pixel belongs to a certain class.

Therefore, adopting the Likelihood models and an Expectation–Maximization (EM) algorithm, the segmented tramway rails (Boolean image) are obtained.

After the determination of the Boolean image of the tramway rails (Fig. 11d) the proposed procedure needs to match the estimated alignment of the rails with respect to the real alignment. In order to fit the model to the real geometric alignment of the track, the RANSAC “RANdom Sampling Consensus” model [43] was implemented. RANSAC is an iterative technique used to estimate the parameters of a mathematical model from a set of observed data that contains a large proportion of outliers in the input data. In short, RANSAC is a resampling technique that produces realistic solutions by means of the minimum number of observations which are required to estimate the underlying model parameters [44]. In RANSAC the number of iterations N is selected to ensure that the probability p (usually set to 0.99) that at least one of the sets of random samples does not include an outlier:

$$N=\frac{log\left(1-p\right)}{log\left[1-\left(1-{v}^{m}\right)\right]}$$
(12)

In which v represents the probability of finding an inlier in any selected point, m is the minimum number of points required to estimate a model and they are selected independently. In case of a parabolic curve in the planimetric tramway alignment, the proposed algorithm estimates the coefficients [a, b, c, d] of the transition curve whose equation is y = a·x3 + b·x2 + c·x + d from the candidate coordinates (input data). In addition this procedure allows validating the camera calibration phase. In fact, it can be estimated the Euler Angle of the camera by comparing a sample of estimated values of gauge of the tramway track with respect to the real value (1435 mm). Through an iterative procedure, the Euler angles are modified until the difference between the estimated values and the real ones becomes less than 1 cm. Fig. 9 explains in a synthetic way the proposed process for camera calibration and validation.

Fig. 9
figure 9

Camera calibration and validation schematic procedure

5.2 Top view Image Transformation

In each instant of time the methodology allows determining the distance of road users from the rails and therefore to identify the critical conditions that require emergency braking of the Tram. The distance considered is that of the users' centroid projected on the ground plane (road pavement surface) with respect to the nearest rail [46]. The procedure for the distance estimation is specified below and it is founded on the inverse perspective mapping (IPM). Let {Fw} = {Xw, Yw,Zw} be the world frame centered at the camera optical center, {Fc} = {Xc, Yc,Zc} the camera frame and {Fi} = {u, v} an image frame (Fig. 10). It is assumed that the optical axis has no roll; in other words, the camera frame Xc axis stays in the world frame XwYw plane. The height of the camera frame with respect to the ground plane is h (Fig. 10).

Fig. 10
figure 10

IPM coordinates. Left: coordinate axes (world). Right: definition of pitch and yaw angles

Using the following homogeneous transformation can be found the projection point on the road plane of each point iP = {u, v, 1, 1} in the image plane [46]:

$${}_{\mathrm{i}}{}^{\mathrm{g}}\mathrm{T}=\mathrm{h}\left[\begin{array}{cccc}-\frac{1}{{\mathrm{f}}_{\mathrm{u}}}{\mathrm{c}}_{2}& -\frac{1}{{\mathrm{f}}_{\mathrm{v}}}{\mathrm{s}}_{1}{\mathrm{s}}_{2}& \frac{1}{{\mathrm{f}}_{\mathrm{u}}}{\mathrm{c}}_{\mathrm{u}}{\mathrm{c}}_{2}-\frac{1}{{\mathrm{f}}_{\mathrm{v}}}{\mathrm{c}}_{\mathrm{v}}{\mathrm{s}}_{1}{\mathrm{s}}_{2}-{\mathrm{c}}_{1}{\mathrm{s}}_{2}& 0\\ \frac{1}{{\mathrm{f}}_{\mathrm{u}}}{\mathrm{s}}_{2}& \frac{1}{{\mathrm{f}}_{\mathrm{v}}}{\mathrm{s}}_{1}{\mathrm{c}}_{1}& -\frac{1}{{\mathrm{f}}_{\mathrm{u}}}{\mathrm{c}}_{\mathrm{u}}{\mathrm{s}}_{2}-\frac{1}{{\mathrm{f}}_{\mathrm{v}}}{\mathrm{c}}_{\mathrm{v}}{\mathrm{s}}_{1}{\mathrm{c}}_{2}-{\mathrm{c}}_{1}{\mathrm{c}}_{2}& 0\\ 0& \frac{1}{{\mathrm{f}}_{\mathrm{v}}}{\mathrm{c}}_{1}& -\frac{1}{{\mathrm{f}}_{\mathrm{v}}}{\mathrm{c}}_{\mathrm{v}}{\mathrm{c}}_{1}+{\mathrm{s}}_{1}& 0\\ 0& -\frac{1}{{\mathrm{hf}}_{\mathrm{v}}}{\mathrm{c}}_{1}& \frac{1}{{\mathrm{hf}}_{\mathrm{v}}}{\mathrm{c}}_{\mathrm{v}}{\mathrm{c}}_{1}-\frac{1}{\mathrm{h}}{\mathrm{s}}_{1}& 0\end{array}\right]$$
(13)

In which: {fu, fv}: are the horizontal and vertical focal lengths; {cu, cv}: are the coordinates of the optical center;

c1 = cosα, c2 = cosβ, s1 = sinα, and s2 = sinβ.

On the contrary from a point on road plane gP = {xg, yg,-h, 1} we can obtain its subpixel coordinates on the image frame by the relationship iP = \({}_{\mathrm{g}}{}^{\mathrm{i}}\mathrm{T}\) ggP using the inverse of the transform [46]:

$${}_{\mathrm{g}}{}^{\mathrm{i}}\mathrm{T}=\left[\begin{array}{cccc}{\mathrm{f}}_{\mathrm{u}}{\mathrm{c}}_{2}+{\mathrm{c}}_{\mathrm{u}}{\mathrm{c}}_{1}{\mathrm{s}}_{2}& {\mathrm{c}}_{\mathrm{u}}{\mathrm{c}}_{1}{\mathrm{c}}_{2}-{\mathrm{s}}_{2}{\mathrm{f}}_{\mathrm{u}}& -{\mathrm{c}}_{\mathrm{u}}{\mathrm{s}}_{1}& 0\\ {\mathrm{s}}_{2}\left({\mathrm{c}}_{\mathrm{v}}{\mathrm{c}}_{1}-{\mathrm{f}}_{\mathrm{v}}{\mathrm{s}}_{1}\right)& {\mathrm{c}}_{2}\left({\mathrm{c}}_{\mathrm{v}}{\mathrm{c}}_{1}-{\mathrm{f}}_{\mathrm{v}}{\mathrm{s}}_{1}\right)& -{\mathrm{f}}_{\mathrm{v}}{\mathrm{c}}_{1}-{\mathrm{c}}_{\mathrm{v}}{\mathrm{s}}_{1}& 0\\ {\mathrm{c}}_{1}{\mathrm{s}}_{2}& {\mathrm{c}}_{1}{\mathrm{c}}_{2}& -{\mathrm{s}}_{1}& 0\\ {\mathrm{c}}_{1}{\mathrm{s}}_{2}& {\mathrm{c}}_{1}{\mathrm{c}}_{2}& -{\mathrm{s}}_{1}& 0\end{array}\right]$$
(14)

With the previous relationships, using the function “EstimateMonoCameraParameters” in MATLAB environment the extrinsic parameters of the camera (\(Euler Angle= \langle \alpha |\beta |\gamma \rangle =\langle pitch|yaw|roll\rangle\)) can be calculated.

It is worth pointing out that varying the angles “pitch”, ”yaw” and”roll” in MATLAB the Bird’s eye view of the analyzed image changes accordingly. The Birds Eye View of the images is of fundamental interest for analyzing the distance between pedestrians and railway tracks. Figure 11 shows a schematic illustration of the top view image transformation [45].

Fig. 11
figure 11

a) Schematic illustration of the top view image transformation (adapted from [45]), (b) original image of pedestrians near the tramway track, (c) road-plane transformation, (d) Boolean image of the tramway track.

5.3 Road Users Tracking

In the tracking phase of pedestrians, vehicles and cyclists, the Kalman filter was implemented with the purpose of tracking the projection on the pavement surface of the centroid of the objects of interest (i. e. pedestrians, vehicles and cyclists).

Kalman filter [47] is a recursive predictive filter able to evaluate the state of a dynamic system. The Linear Kalman Filter is applied for the evaluation of the coordinate of the object of interest. The dynamic equation is [48, 49]:

$${\mathrm{x}}_{\mathrm{n}+1}= {\mathrm{A}}_{\mathrm{n}}{\mathrm{x}}_{\mathrm{n}}+{\mathrm{B}}_{\mathrm{n}}{\mathrm{u}}_{\mathrm{n}}$$
(15)

Considering the error covariance [48, 49]:

$${\mathrm{P}}_{\mathrm{n}+1}= {\mathrm{A}}_{\mathrm{n}}{\mathrm{P}}_{\mathrm{n}}{{\mathrm{A}}_{\mathrm{n}}}^{\mathrm{T}}+{\mathrm{Q}}_{\mathrm{n}}$$
(16)

in which xn is the state value (object coordinates) at step n, An is the state transition matrix, un is the measurement and the input and at step n. Qn is the white noise covariance [49]. This step is called the “prediction step” since it estimates the n + 1 state. Kalman gain value is given by the following relationship [49]:

$${\mathrm{K}}_{\mathrm{n}}= {\mathrm{P}}_{\mathrm{n}}{\mathrm{C}}^{\mathrm{T}}{\left({\mathrm{CP}}_{\mathrm{n}}{\mathrm{C}}^{\mathrm{T}}+ {\mathrm{R}}_{\mathrm{n}}\right)}^{-1}$$
(17)

where C is the measurement matrix and R is measurement noise.

Actual measurement value at the updated time and error covariance is [49]:

$${\mathrm{P}}_{\mathrm{n}}= \left(\mathrm{I}-{\mathrm{K}}_{\mathrm{n}}\mathrm{H}\right){\mathrm{P}}_{\mathrm{n}}$$
(18)

Being Kn the measurement value and H the mapping matrix from true state to observation.

The initial covariance matrix is a diagonal matrix having high values since the centroids of the Bounding Boxes that identify the pedestrians in many frames of the analyzed video sequences often are not always clearly distinguishable. The Kalman filter can be applied to different object motions (i.e. constant velocity and constant acceleration); the amount of deviation from the ideal motion model allows fitting the real motion conditions with respect to the real ones.

Figure 12 shows some examples of pedestrians and vehicles detection near the tramway track under study, instead Fig. 13 gives an example of the Kalman filter application for the evaluation of pedestrians’ trajectories. It is worth pointing out that the Kalman filter is able to estimate the trajectory of a pedestrian even if it overlaps with another object in the scene.

Fig. 12
figure 12

Example of pedestrians and vehicles detection near the tramway track

Fig. 13
figure 13

Results of the Kalman filter application

6 Results

Figure 11c shows an example of IPM: it can be noted the position of a pedestrian with respect to the rails of the tramway track. Therefore, the functions described in the previous sections allow to estimate the distances between the pedestrians and the rails and to evaluate the risk conditions that require emergency braking maneuvers even in situations in which prospectively the trajectories of pedestrians are arranged between them in partial or total occlusion. As shown in Figs. 14, 15, for the pedestrians, the procedure is capable of estimating the speed component orthogonal to the rails when they approach or cross the tramway track (Fig. 14). Similar results can be obtained in case of vehicles, bicyclists, animals, etc.

Fig. 14
figure 14

Examples of detection, tracking and distances measurement of two pedestrians

Fig. 15
figure 15

Speed of two pedestrians (cf. Fig. 14): a) pedestrian n. 1; b) pedestrian n. 2

The results show the effectiveness of the proposed algorithm in detecting with high accuracy and precision the tramway track, the road users (pedestrians) and their trajectories and speeds and therefore this method could be used in advanced driver-assistance systems ADAS or in on autonomous trams and autonomous rapid trams (ARTs) in order to achieve a high level of safety.

The proposed method is able to detect and follow the position of each road user using the tracking algorithm.

However, some FP and FN were found during the detection phase of the algorithm [50, 51]. In general, detection errors are associated with a variety of causes, including the camera viewing angle, the test vehicle oscillations, other objects near the tramway line and many others [52]. Consequently, to better appreciate the reliability of the proposed technique, an error analysis was performed by comparing the number of road users detected and the real one. Figure 16 exemplifies the main results of the validation procedure: the correct detection rate ranges from 96.0% to 100% depending on the road user type.

Fig. 16
figure 16

Number of detected road users, total number of real road users and correct detection rate (A: cars, B: heavy vehicles; C: pedestrians, D: cyclists)

The tracking phase permits the detection of the same vehicle or pedestrian in all the positions which may be shown in a sequence of consecutive video frames. The system successfully identified pedestrians and vehicles in the lateral position with respect to the rails.

Therefore, the speed component of the users orthogonal to the rails can be estimated and by specific safety protection algorithm the collision warning function can be activated.

The implementation of this advanced detection method based on deep learning together to a collision warning system in ADAS may increase the safety of novel autonomous trams and autonomous rapid trams (ARTs). For such novel types of transportation systems, a Safety Protection Framework could be structured like the one shown in Fig. 17. The implementation of this advanced detection method based on deep learning together to a collision warning system in ADAS.

Fig. 17
figure 17

Safety Protection Framework

7 Conclusions

Artificial intelligence and deep learning-based techniques are the future of Advanced Driver-Assistance Systems (ADAS) technologies. In this article is presented a technique for real-time pedestrians, vehicles and cyclists detection, tracking and recognition along a tramway infrastructure in a complex urban environment by Computer Vision and deep learning approaches. In particular, the YOLOv3 algorithm, RANSAC model and Kalman filter were used.

Experimental activities were conducted in the tramway Line 2 “Borgonuovo –Notarbartolo” in the city of Palermo (Italy) on the segment crossing a Tree-arms roundabout. A survey vehicle equipped with a video camera was used in the research. Traffic video recordings, with the resolution of 1280 × 720 pixels, were analyzed using a workstation with Intel(R) Core(TM) i7-4510 CPU @ 2.00 Hz 2.60 GHz – Memory RAM 20 Gb, Windows 10 Home.

The proposed method is able to search and detect the position of private vehicles, pedestrians and cyclists near and over the rails in front of the tram in a very precise way despite the detection and the tracking of obstacles in complex urban environment have many complications due to the tram speed, the dissemination of forms (blurring) and the changing of the background which can greatly complicates its accuracy.

In fact, the Accuracy, the Loss and the Precision values obtained with the YOLOv3, during the Neural networks training process prove that the proposed training model detects even the position of several obstacles (vehicles, pedestrians and cyclists) at the same time with high accuracy in each frame. In particular the pedestrian trajectories and the speeds component orthogonal to the rails can be estimated. This type of information is useful in the safety protection algorithm and in the collision warning function.

Although the proposed procedure needs to be validated with a greater number of tests, the first results demonstrate the effectiveness of the proposed algorithm with overall good outcomes (minimum correct detection rate: 96%).

Therefore, the proposed method guarantees high accuracy and precision in object detection and therefore could be used in advanced driver-assistance systems or in autonomous trams and autonomous rapid trams (ARTs) in order to achieve a high level of safety both in traditional and smart cities.