1 Introduction

Multiple object detection consists of detecting and identifying objects in scenes. In smart mobility applications, vehicles must have a better perception of their environment. This perception must guarantee not only a good detection of the objects, but also a good interpretation of the scenes. In multimodal road and railway smart mobility, the objects of interest to be detected are vehicles, pedestrians, bus, cyclists, trees, etc. If the road sector is widely covered in the scientific literature, this is less the case for the railway smart mobility. This is in part due to the lack of specific railway datasets with ground truth data dedicated to 3D object detection. This poses a real challenge in that it requires the 3D object detection algorithm to train under dataset with ground truth data.

In the past few decades, specifically since 2012, Deep Learning has become a very powerful tool because of its ability to handle large amounts of data. The interest using more and more hidden and intermediate fully connected layers has surpassed traditional techniques in image processing and computer vision, especially in pattern recognition, object detection, and classification [1, 2]. One of the most popular deep neural networks is CNN. With the rise of Deep Learning approaches for computer vision tasks and the emergence of CNNs, it has become possible for cameras to be used for object detection, depth estimation, tracking, instance, and semantic segmentation, etc. These kinds of methods have many applications, especially for smart mobility to enhance safety and increase vehicle autonomy. This need for environmental perception has led to the development of 3D detection methods, using either a LiDAR-type depth sensor or images from cameras, to improve safety. Although 3D object detection is a much sought-after field, few methods focus on the real-time aspect that is essential for a real-world application. Of course, in road and railway multimodal smart mobility, it is desired that 3D multi-object detection is performed in real time. This represents another challenge. In this paper, we propose to fill these gaps, lack of railway dataset with ground truth data and real-time 3D object detection, by proposing a new virtual dataset (GTAV) dedicated to both road and railway environments for 3D object detection, which includes images taken from the point of view of both cars and trains. With this dataset, we propose a new method dedicated to real-time 3D object detection based on the well-known 2D object detector YOLOv5 [3]. With this new method, we introduce 3D anchor boxes which makes our method the fastest method available for 3D object detection with accuracy comparable to state-of-the-art methods. Using a pre-trained model on our GTAV dataset, our approach outperforms state-of-the-art methods. Through our experiments, we also demonstrated the validity of our method for real-time embedded applications for autonomous vehicles (cars and trains). The main contributions put forward in this paper are the following:

  1. 1.

    A new Lightweight CNN (L-CNN) method for real-time 3D object detection based on the YOLOv5 2D object detector. Our L-CNN allows the following object predictions:

    • 2D detection

    • Distance from the camera

    • 3D centers projected on the image plane

    • 3D dimensions and orientation

  2. 2.

    A new virtual dataset (GTAV) dedicated to both road and railway environments for 3D object detection. The GTAV dataset includes images taken from the point of view of both cars and trains.

The remainder of this paper is organized as follows. Section 1 introduces the paper. In Sect. 2, we review the related work for autonomous navigation datasets, 3D object detection approaches in which we will present methods based on depth sensors as well as methods based on monocular images. In Sect. 3, we describe in more details our multimodal road/railway GTAV dataset. Our new L-CNN model about the 3D multi-class object detection is presented in Sect. 4. Analysis and experimental results are described in Sect. 5. Finally, the conclusion and future directions are outlined in Sect. 6.

2 Related work

2.1 Autonomous driving datasets

Any approach based on Deep Learning requires annotations (also called “ground-truth”) to perform a training. One of the main factors in the performance of any of these approaches in terms of precision is the quality and quantity of the data in the dataset. Among the datasets dedicated to autonomous vehicles, the vast majority are intended to autonomous driving in road environments and focus on object detection and scene segmentation. Some of them are mono-modal (in general, with images as a single modality), and others are multi-modal.

Even though the building of mono-modal datasets is less time-consuming and expensive than multimodal datasets. The latter represent a crucial advantage in the use of supervised artificial intelligence approaches because they offer a wider variety of ground truths from different sensors. Although these data are expensive because they require the use of expensive sensors such as LiDAR or radar, they are indispensable for environmental perception applications such as object detection/localization, distance estimation, tracking. We can also find many other road datasets for similar tasks such as CityScape [4], PascalVOC [5], etc. The best-known road dataset is KITTI [6] (Karlsruhe Institute of Technology and Toyota Technological Institute), which includes a rich ground truth from both distance sensors (LiDAR), a GPS/IMU unit, and stereoscopic cameras (color or grayscale). It comprises different information captured and synchronized at 10Hz. The vehicle is equipped with different sensors: cameras, 3D Velodyne LiDAR (100k points per frame), 3D GPS/IMU data (location, speed, acceleration). The camera is calibrated and synchronized with Velodyne LiDAR and with a GPS/IMU sensor [6]. KITTI includes 3D object track labels for different object classes (cars, trucks, trams, pedestrians, cyclists). It is a reference base for both training and evaluation of methods for stereo, optical flow, visual odometry, 2D object detection, 3D object detection, depth estimation, 2D tracking, etc. Among the road datasets, we can also find the NuScenes [7] dataset which is a public large-scale dataset for autonomous driving. NuScenes is more recent than the KITTI dataset which inspired it. It includes 1000 driving scenes (of 20 s in length each) in Boston and Singapore, cities known for their dense traffic. It includes 23 object classes annotated with accurate 3D bounding boxes at 2 Hz which is a good annotation for common autonomous driving tasks such as object detection, distance estimation, and tracking. NuScenes includes approximately 1.4 M camera images, 390 k LIDAR sweeps, 1.4 M RADAR sweeps, and 1.4 M object bounding boxes. As with KITTI, this dataset incorporates one LiDAR sensor, 5 radar sensors, 6 cameras to provide 360\(^{\circ }\) coverage around the vehicle, and one GPS/IMU unit [7]. This dataset contains  100x more images than KITTI and allows training methods for 2D and 3D object tracking, 2D and 3D object detection, depth estimation.

ROad [8] event Awareness Dataset for Autonomous Driving (ROAD), includes 22 videos with 122 K annotated video frames, and 560 K detection bounding boxes with 1.7 M individual labels. ROAD is a dataset that was designed to test a robot-car’s situation awareness capabilities. The annotation process follows a multi-label approach in which road agents (vehicles, pedestrians, etc.), their locations as well as the action they [8].

We can find many other road datasets for similar tasks which are also multimodal large-scale autonomous vehicles datasets. Argoverse [9] dataset images are captured under different weather conditions. It includes annotations for RGB images, LiDAR/radar, and 3D boxes. It includes more sensors than both KITTI [6] and NuScenes [7] datasets. Argoverse includes 3D bounding boxes with tracking information for 15 objects of interest [9]. Another multimodal dataset, Lyft [10] uses cameras and LiDAR to provide 2D annotation and 3D bounding boxes for 4 object classes. All these datasets provide contextual knowledge through human-annotation which is very important for scene understanding applications. Waymo open dataset [11] is a new large-scale, high-quality LiDAR and camera data that includes 1150 scenes (each with the 20s). Sensors produce annotated data with 2D (camera image) and 3D (LiDAR) bounding boxes, with consistent identifiers across frames.

To our knowledge, there are only a few image datasets dedicated to the railway [12, 13]. However, these datasets are not dedicated to 3D detection and do not have ground truth information on depth estimation. Therefore, these datasets cannot be used for the training or evaluation of 3D object detection approaches.

Finally, there are databases of virtual images created with simulators such as CARLA [14] or SYNTHIA [15]. The advantage of these databases is that they are simpler to acquire, the ground truth data are acquired automatically by the API simulator, which allows the acquisition of mass data. It also allows acquiring a database adapted to specific needs such as the understanding of scenes in the railway environment. However, one of the disadvantages of this kind of dataset is that they have dated graphics, which makes it difficult to use trained dataset in real conditions. To overcome this problem, recent work has been done on the acquisition of a database via a video game [16]. Indeed, video games make graphic quality their priority, which makes using of video games particularly interesting. The video game GTAV is used to acquire road databases similar to KITTI or CityScape. Although there are many virtual road datasets, to our knowledge, there is no similar state-of-the-art work for the railway.

2.2 3D object detection from monocular images

In the very recent years, many methods based on deep learning have been proposed for the detection of 3D objects from monocular images. Most of them use single RGB images, and a few use image sequences. Many of these methods are extension of proven 2D detectors, thus leveraging their very good performances. 3D parameters processing is then added to the models. Some other methods infer 3D parameters directly through end-to-end learning.

In [17], the problem is tackled in two steps. First candidate 3D boxes are generated and scored by exploiting several features, namely class semantic, instance semantic, contour, object shape, context, and location prior. Then the candidates with the best scores are applied to a Fast R-CNN [18] to predict class labels, bounding-box offsets and object orientation.

In [19], 2D detection is done by the multi-scale region-proposal MS-CNN [20], extended to regress the orientation and the dimensions of the 3D bounding boxes. Estimation of orientation uses a MultiBin principle, in which angle is turned to a hybrid classification and regression task, in which values are separated into bins representing the classes.

DeepMANTA [21] is a coarse-to-fine model for simultaneous vehicle detection, localization, visibility characterization, and 3D dimension estimation. It is based on a two-step process, the first one of which outputs bounding boxes associated with vehicle information, and the second one uses these outputs and a 3D vehicle virtual dataset to recover 3D orientations and locations. In [22], a depth map obtained by means of a fully-convolutional network (FCN) is fused with the input RGB image before feeding a region-proposal network (RPN). It is also combined at different level in the flow stream to finally get estimation of class labels, 2D box position, and 3D box orientation, dimensions and location. For the orientation, the same MultiBin principle as in [20] is used.

In [23], a CNN model for joint 3D object detection and tracking is proposed. It is based on a faster RCNN to predict the 2D bounding boxes, which are then associated to 3D information including position, orientation, dimensions, and projected 3D box centers of each object. The tracking across sequences of frames is obtained by two LSTM (Long Short-Term Memory) layers, allowing a further refinement of the 3D bounding box positions.

In SMOKE [24], a CNN is trained end-to-end in a single stage by means of a unified loss function. The parameters of 3D bounding boxes are regressed directly without using 2D detection, thanks to a network made of two branches: a classification branch where an object is encoded by its 3D center (projected on the image plane), and a 3D parameter regression branch to construct 3D bounding box around each center. In MonoGRNet [25], 3D object detection is decomposed into four sub-tasks: 2D object detection, object center depth estimation, projected 3D center estimation and local corner regression. Those four pieces of information are obtained in parallel through a single-pass process, and then combined.

Anchor-based methods have also been proposed in M3D-RPN [26] and GAM3D [27]. M3D-RPN leverages depth-aware convolution for locating the specific features and improve the 3D scene understanding. The approach consists only of a single-stage using anchor boxes for predicting both 2D and 3D object positions. GAM3D also uses 3D/2D anchor boxes to predict the objects’ 3D bounding box and introduces ground-aware convolution module to provide additional 3D cues to the network, thus improving the accuracy of the network.

In [28], 3D detection is reformulated as a keypoint detection problem. A keypoint feature pyramid network (KFPN) is defined, providing the center and the perspective points of the 3D bounding-boxes. The fact that the network backbone is based on a lightweight architecture (such as ResNet-18 [29] or DLA-34 [30]) and the anchor-free nature of the detector, allow real-time processing on 1080Ti GPU.

Most of these models are dedicated to improve performance of 3D object detection in terms of accuracy. A few other focus more on processing time, aiming for real-time on powerful GPUs. In our case, we want to focus on lightweight CNN architectures making it possible to achieve real-time on low-power GPUs like the Jetson-TX2. We believe that it is important to go further on this point to facilitate the large-scale development of deep-learning solutions for the safety of autonomous vehicle. Another important aspect of our needs is to be able to perform well in road environment as well as railway environment. The lack of existing railway dataset has led us to develop our own dataset, made of virtual images from GTA-V game. We will use this custom dataset in complement to KITTI. In addition to our virtual railway dataset, we are developing a real railway dataset with ground truth allowing us to go further in the development of railway e-Advanced Driver Assistance Systems (e-ADAS).

3 Multimodal road/rail dataset

3.1 GTAV virtual multi-modal dataset presentation

To compensate for lack of real dataset in the railway environment, we turn to a virtual dataset. The main advantage of a virtual dataset is the simplification of its acquisition and annotation. However, this is done at the cost of the fidelity to reality. Indeed, most of these datasets (such as CARLA) do not have sufficient graphical fidelity to allow the methods trained on them to offer good performances once applied in real world conditions. To overcome this problem, we turn to video games to acquire a railway database for 3D object detection. Indeed, where simulators lack graphical fidelity, video games try to offer the best graphical fidelity to offer users a feeling of authenticity. This is why several works have already been carried out in this field to extract databases. One of the most interesting games is the video game Grand Theft Auto V, which allows the user to drive road vehicles and presents realistic road scenes with, for example, level crossings, crosswalks, heavy traffic, etc. A lot of work has been done in this direction to build databases for the road environment [16, 23]. Our railway database has been acquired with a data acquisition pipeline based on the work of [31]. To do this we used a modified version of the game [32] code to allow the user to drive a subway or a train, allowing us to build a hybrid database with both road and rail scenes that can then be used for training and evaluation of our 3D detection approach.

3.2 GTAV virtual multi-modal dataset architecture

The ground truth is generated automatically by the game’s API and includes among other things the 3D and 2D bounding boxes, the object class, the depth map and the semantic map. All these annotations are taken from the point of view of a stereoscopic camera with an acquisition frequency of 10 Hz. Some statistics for our dataset can be found in Figure 1.

Fig. 1
figure 1

Statistic details for our new dataset: Top Left is the image daytime distribution, Top Right is the weather condition distribution, Bottom Left is the Road/Railway distribution, Bottom Right is the class distribution

During the acquisition of the dataset, several classes of objects were annotated. However, to combine our virtual dataset with the 3D detection KITTI dataset, we selected only the classes that are also present on KITTI (Car, Person, Cyclist) dataset. Our dataset is composed of  140 K samples from the stereoscopic camera. Each of these samples has a depth map and a ground truth for 3D detection (position in the camera coordinates, dimensions, orientation, etc.). Image samples from our dataset can be found in Fig. 2. Our dataset also offers various situations and environmental conditions resulting in a challenging dataset for computer vision tasks. The dataset includes night, day, rainy and clear weather scenes, offering a wide variety of visibility conditions.

Fig. 2
figure 2

Images with 3D bounding box ground truths. Our dataset presents a variety of environments and conditions. From top left to bottom right: road with clear weather, Road at night, Road with rainy weather and Rail with clear weather. The ground truths shown are for the same classes as KITTI: Car, Person and Cycle

4 3D multi-class object detection

Our method is a single-stage 3D object detector based on the YOLOv5 2D detector neural network. Indeed, it is a light convolutional network that offers good accuracy, allowing real-time use even on embedded systems such as Nvidia Jetson TX2. YOLOv5 offers a significant performance gain compared to YOLOv5 thanks to an improved network architecture as well as innovations on data augmentation during training. The single-stage design of our network allows for better computational times compared to multiple stage methods [19, 21, 23] and it can be trained end-to-end. Another problem posed by multiple-stage methods was the degradation in accuracy for the 3D parameters regression when the predicted RoIs used for feature alignment vary from the ones used in the training [33]. Our single-stage method uses hybrid anchor boxes to directly regress the 2D bounding box as well as the 3D parameters. We thus avoid the problem related to the fluctuation of the accuracy when the ROIs are not fixed. This new approach is the fastest method for 3D prediction from images without sacrificing accuracy. Figure 3 shows an overview of our 3D multi-object detection method.

Fig. 3
figure 3

Overview of our method for 3D multi-object detection. A single RGB image is used as input. We leverage our Hybrid anchor boxes to predict both the 2D and the 3D bounding boxes. Non-Maximum-Suppression is used to filter the predictions. Among the 3D parameters that we predict, we have the projected 3D center on the image plane, the distance of the object, its dimensions, and finally its orientation

4.1 3D bounding box estimation

The 3D bounding box of an object can be described by its center position from the camera \(\mathbf {T}=[x~y~z]^{\top }\), its dimensions \(\mathbf {D}=[w,h,l]\) (with \(\mathbf {w,h,l}\) respectively width, height and length) and its orientation \(\mathbf {R}(\phi ,\theta ,\psi )\) (with \(\phi \), \(\theta \), \(\psi \) respectively the elevation, azimuth, and roll angles). Given \(\mathbf {K}\) the matrix of intrinsic parameters of the camera and \(\mathbf {X_{o}}=[x_{o}~y_{o}~z_{o}~1]^{\top }\) a 3D point in the object coordinate system, the re-projection of this point into the image plane \(\mathbf {x_{im}}=[u~v~1]^{\top }\) is given by:

$$\begin{aligned} \mathbf {x_{im}}=\mathbf {K}.\begin{bmatrix} \mathbf {R}&\mathbf {T} \end{bmatrix}.\mathbf {X_{o}}, \end{aligned}$$
(1)

where [RT] represents the extrinsic matrix with R (\(3 \times 3\) matrix) the rotation and T (\(3 \times 1\)) the translation. It is an extrinsic matrix integrating Translation (T) and Rotation (R). By considering that the center of the 3D bounding box is the origin of the object coordinates, the coordinates of the 3D bounding box are: \(\mathbf {X_{1}}=\begin{bmatrix} w/2&h/2&l/2 \end{bmatrix}\), \(\mathbf {X_{2}}=\begin{bmatrix} w/2&-h/2&l/2 \end{bmatrix}\), ..., \(\mathbf {X_{8}}=\begin{bmatrix} -w/2&-h/2&-l/2 \end{bmatrix}\). The 3D bounding box coordinates in the image can then be obtained using Equation (1).

4.2 Parameters to regress

To better compare the performances of multi-stage approaches with single ones for 3D object detection, we predict similar parameters as it was presented in our previous multi-stage approach [33].

4.2.1 2D object detection

In our work, we use YOLOv5 to perform 2D object detection. YOLOv5 performs the object class classification cls as well as the bounding box position regression b. The bounding box b is then used as prior information for predicting the object center and the object class cls is used in the object dimension prediction.

4.2.2 Object center prediction

In this work, we assume that the 3D center of the object is the center of the 3D bounding box. Predicting the center of the 3D bounding boxes of the objects is, therefore, equivalent to predicting their position \(\mathbf {X_{o}}=[x_{o}~y_{o}~z_{o}~1]^{\top }\). But instead of directly predicting the coordinates of the 3D object center on the image plane, we use cues from the 2D bounding box estimation. We predict the offset position in pixels of the projected object center from the 2D bounding box center \(\tilde{\mathbf {c}}=\begin{bmatrix} {\tilde{cx}}&{\tilde{cy}} \end{bmatrix}\). The variance of the prediction is thus reduced, making it easier to be learned by the algorithm. The object center \(\mathbf {X_{o}}\) is then computed using the object distance estimation on the Z-axis and the inverted calibration matrix \(\mathbf {K^{-1}}\).

4.2.3 Object distance estimation

To obtain the center of the 3D bounding box, it is necessary to get its position on the Z-axis of the camera coordinates. For each ROIs from the object detector, we predict the object’s center distance \({\tilde{z}}\) in meters.

4.2.4 Object dimensions

Instead of directly predicting the object’s dimensions, we exploit the fact that those dimensions have a very low variance within the same class (car, truck, etc). Therefore, we choose to use the average dimensions of each object class as a strong prior for the estimation of the dimensions.

4.2.5 Object orientation

In our work, we assume that only the azimuth (noted \(\theta \)) matters for application on the road environment, and thus we do not predict the orientation characterized by the elevation and roll and we set \(\phi =0\) and \(\psi =0\). Since the same azimuth can lead to multiple observed orientations from the camera point of view (see Fig. 4), we cannot directly predict the angle \(\theta \). Instead, we predict the observed angle \(\alpha \) and retrieve the global orientation \(\theta \) using Eq. (2):

$$\begin{aligned} \theta = \alpha +arctan(\frac{x}{z}). \end{aligned}$$
(2)

Following the work described in [19], instead of considering the angle prediction as a regression task, we formulate is as a hybrid classification/regression problem in which the possible angles are separated into two bins of which the values are centered on 90 for the first one and \(-90\) for the second one. We then perform a classification to predict in which bin the object angle is located. The parameter which is regressed is the difference between the bin center and \(\alpha \).

Fig. 4
figure 4

Illustration of the object azimuth \(\theta \) and its observed orientation \(\alpha \). The local orientation is retrieved by computing the angle between the normal to the ray between the camera and the object center and the X-axis of the camera. Given that we use left-hand coordinates, the rotation is clockwise. Our method estimates the observed orientation and \(\theta \) can be obtained using Eq. (2)

4.3 Hybrid anchor boxes for 3D detection

To take advantage of the innovations brought by single-stage approaches, we have modified the YOLOv5 anchor boxes to create hybrid anchor boxes to perform the regression of the 2D bounding boxes and the 3D parameters. We predict three grids of features at different sizes. In each grid, three anchor boxes are predicted. These boxes contain the predictions of the classes of the objects, the relative position of the bounding box from the center of the grid, the probability that an object is present, and finally the 3D bounding box parameters (projected center, distance, dimensions, orientation). The output of our hybrid anchor boxes for our 3D object detector is detailed in Eq. 3. Let y be the output value of one of our new anchor boxes:

$$\begin{aligned} \mathbf {y}= \begin{bmatrix} \mathbf {b_{off}}&obj&\mathbf {cls}&\mathbf {c_{off}}&Z&\mathbf {D}&\mathbf {O}. \end{bmatrix} \end{aligned}$$
(3)

With \(\mathbf {b_{off}}=\begin{bmatrix}x&y&w&h\end{bmatrix}\) the 2D bounding box offset prediction, obj is the object prediction, \(\mathbf {cls}=\begin{bmatrix}cls_1&...&cls_N\end{bmatrix}\) the score for the N classes, \(\mathbf {c_{off}}=\begin{bmatrix}x_c&y_c\end{bmatrix}\) the prediction of the offset between the projected 3D center and the center of the bounding box in pixels, Z is the position on the Z-axis of the 3D center of the object in meters, \(\mathbf {D}=\begin{bmatrix}W_1&H_1&L_1&...&W_N&H_N&L_N\end{bmatrix}\) is the predicted dimensions offset in meters for each N class. The predicted offset is between the object’s dimension and its class mean dimensions. Finally we have the orientation prediction \(\mathbf {O}=\begin{bmatrix}bin_{1}&\overline{bin_{1}}&sin_1&cos_1&bin_{2}&\overline{bin_{2}}&sin_2&cos_2\end{bmatrix}\), where \(bin_1\) and \(bin_2\) represent the predicted probability that \(\alpha \in \begin{bmatrix}-195&15\end{bmatrix}\) and \(\alpha \in \begin{bmatrix}-15&105\end{bmatrix}\), respectively with \(\alpha \) the observed orientation of the object. And \(sin_{i\in \left\{ 1,2 \right\} }\) \(cos_{i\in \left\{ 1,2 \right\} }\) are the sinus and cosine of the observed angle offset regarding the center of the ith bin. Our bin centers are \(90^\circ \) and \(-90^\circ \), respectively. By adding the prediction of the 3D parameters inside the anchor boxes, our method learns to predict the distance of the objects, the position of their centroid, their dimensions, and their azimuth. With these parameters, we are then able to draw the 3D bounding boxes of the objects. All the anchor boxes of the different grids are then merged and re-projected onto the image and a Non-Maximum-Suppression function is used to filter the anchor boxes. This new approach taking advantage of anchor boxes allows our method to do 3D bounding box prediction in a single stage, thus reducing the size of our network without sacrificing accuracy. Another notable advantage over other multiple stage 3D detection methods is that it avoids the problem of deteriorating accuracy of 3D detection when the predicted ROIs for objects vary, especially when aligning features for 3D regression.

4.4 Losses

In this subsection, we present the loss calculation of our method. The loss is specified in Eq. (4):

$$\begin{aligned} L=L_{yolo}+k_1.L_{center}+k_2.L_{distance}+k_3.L_{dim}+k_4.L_{orient}, \end{aligned}$$
(4)

where \(\mathbf {L_{yolo}}\) is the loss for 2D detection and class prediction, and \(\mathbf {L_{center}}\), \(\mathbf {L_{distance}}\), \(\mathbf {L_{dim}}\) and \(\mathbf {L_{orient}}\) are the losses for the 3D object center, distance, dimensions and orientation, respectively. \(k_{i\in [1,...,4]}\) are the weights of the different losses. For \(\mathbf {L_{yolo}}\) loss, we use the same as YOLOv5. With \(\mathbf {c_k}=\begin{bmatrix} cx^k&cy^k \end{bmatrix}\) the projection on image plane of 3D object k ground-truth center in pixels, \(\mathbf {Rc_{k}}=\begin{bmatrix} Rcx_{k}&Rcy_{k} \end{bmatrix}\) the center of the corresponding ROI, N the number of objects and \(\tilde{\mathbf {c}}_{\mathbf {k}}\) the prediction from our network, the center loss is defined by Equation (5):

$$\begin{aligned} L_{center}= Mean(\frac{1}{N} \sum _{k=1}^N \left| \mathbf {c_k}-\mathbf {Rc_{k}}-\tilde{\mathbf {c}}_{\mathbf {k}} \right| ). \end{aligned}$$
(5)

Assuming \(z_k\) the ground-truth distance of an object k, N the total number of objects and \({\tilde{z}}_k\) the distance prediction from our method, the distance loss is then calculated using the L1 loss and is detailed in Eq. (6):

$$\begin{aligned} L_{distance}= \frac{1}{N} \sum _{k=1}^N \left| z_k-{\tilde{z}}_k \right| . \end{aligned}$$
(6)

With \(\mathbf {d_{k}}=\begin{bmatrix}\mathrm{d}x&\mathrm{d}y&\mathrm{d}z \end{bmatrix}\) the ground-truth dimensions of an object k in meters, \(dd_k\) the mean dimension for the class of object k, N the number of objects and \({\tilde{d}}_k\) the prediction from our method, the distance loss is detailed in Eq. (7):

$$\begin{aligned} L_{dim}= Mean(\frac{1}{N} \sum _{k=1}^N \left| \mathbf {d_k}-\tilde{\mathbf {d}}_{\mathbf {k}} - \mathbf {dd_k}\right| ). \end{aligned}$$
(7)

Finally, assuming \(\alpha _k\) the ground-truth observed angle of object k, N the total number of objects and \(\tilde{\alpha }_k\) the prediction, we use the smooth L1 loss as the orientation loss. The loss is written in Eq. (8) with \(\beta =1\):

$$\begin{aligned} L_{orient}= \frac{1}{N} \sum _{k=1}^N e_{k}, \end{aligned}$$
(8)

where

$$\begin{aligned} e_{k} ={\left\{ \begin{array}{ll}0.5 (\alpha _k - \tilde{\alpha }_k)^2 / \beta , &{} \text {if } |\alpha _k - \tilde{\alpha }_k| < \beta \\ |\alpha _k - \tilde{\alpha }_k| - 0.5 \times \beta , &{} \text {otherwise }\end{array}\right. }. \end{aligned}$$

5 Experimental results

5.1 Training details

5.1.1 Model size

Our method can be presented in different versions with a different size for each, such as in YOLOv5. The depth and width of the network vary with the versions which result in several different learnable parameters. Thus our model L-CNN comes in 3 versions: Large, Medium, and Small. To analyze the performance in terms of 3D detection accuracy and computation time of each version, we have trained and evaluated each of them on the KITTI and GTA datasets.

5.1.2 Split attention model

Among the convolutional blocks composing the network, the bottleneck introduced in [34] is extensively used in YOLOv5. Improvements to these convolutional blocks have recently been introduced in [35] along with Split Attention convolutions. These new Split Attention Bottlenecks increase the performance of networks at the cost of a slight increase in computation time. To create a model with a good compromise between accuracy and computation time, we have modified the small model by replacing the bottlenecks with Split Attention Bottlenecks. This new model, called Small SA (Split Attention), is dedicated to applications on systems with limited computational resources, such as the Jetson TX2 embedded boards, without sacrificing prediction accuracy.

5.1.3 KITTI

We trained our method on the KITTI dataset dedicated to 3D detection on the same training and validation split defined in [36]. The training split contains 3712 images and the validation split 3769 images. To speed up the training process and improve accuracy, we trained our models with either pre-trained weights from YOLOv5 or pre-trained weights from our model after training for 15 epochs on our GTAV dataset. We demonstrate that pre-training our model on our dataset significantly improves the accuracy of our method on the KITTI dataset. Furthermore, as in [27] we perform the training with both the left and right images of the training split of KITTI thus doubling the number of images available for the training resulting in improved performances of our method during evaluation. To make our method compatible with embedded boards such as the Nvidia Jetson TX2, we have done the training with a lowered resolution of \(672 \times 224\). We also trained with a resolution of \(1312 \times 416\), closer to the original resolution, to compare the accuracy of our method with the state-of-the-art method on the KITTI dataset.

5.1.4 GTA

Inspired by previous work on dataset creation [16] using images from the video game Grand Theft Auto V, we created our hybrid dataset of road and rail images. This new dataset allows us to overcome the problem of having a railway dataset with a ground truth rich enough to allow 3D bounding box learning for cars, pedestrians, and motorcycles. The training of our method was conducted on a split containing road and railway images (107 K images). The evaluation was then conducted on the validation split of the dataset containing 11,630 images. The training on this dataset was performed on 50 epochs.

5.1.5 Data augmentation

Data augmentation is very useful to improve performance of CNN models through the construction of new and different examples for training. Even though KITTI and our GTA-V dataset presents a variety of environments and conditions (in GTA-V dataset for example, we have different situations like road with clear weather, road at night, road with rainy weather, and rail with clear weather as it shown in Fig. 1), the risk of overfitting our models is still present. Data augmentation gives multiple advantages: 1. Reduce overfitting, 2. Increases the generalisation capabilities of the trained models thus yielding better results in real-world environment. This is a good way to introduce “non-linearity” into the dataset model making it more similar to the real world.

We use Mosaic data augmentation as defined in YOLOv4 [37] which consists in mixing 4 training images to improve the understanding of the spatial context of objects for our method. We also applied translation and scaling transformations to the images as further augmentation.

5.1.6 Hyperparameters

The hyperparameters of our models determined after many iterations of short training sessions. We use the Adam optimizer with a cyclic learning rate scheduler as described in [38]. The maximum learning rate was set to \(9.4e-4\) with a final learning rate of \(1.8e-5\). Finally, we use gradient accumulation to perform training with an effective batch size of 64.

5.1.7 Loss weights

Since our method is trained end-to-end, our loss function combines the loss of each task. To obtain optimal results it is necessary to find the weight (k1, ..., k4) of each loss. Through multiple trials we found \(k1= 0.005\), \(k2 = 0.2\), \(k3= 0.0176\), \(k4 = 0.01\). To avoid overfitting for 2D detection, we imposed the condition \(L_{yolo} < 0.1\) which when satisfied \(L_{yolo}\) is ignored in the optimization loop.

5.2 Evaluation

In this subsection, we present the metrics that will be used in the evaluation of our models.

5.2.1 3D average precision

For evaluation on the KITTI dataset, we used the official benchmark metrics such as the 3D Average Precision (AP) with 40 recall positions as presented by [39]. The official KITTI benchmark uses an IOU threshold of 0.7 for the car class and a 0.5 threshold for the person and cycle classes. We also added evaluation results for the car class with a 0.5 threshold.

5.2.2 2D detection

To evaluate the performance of the 2D detection, we used the metrics described by the authors of YOLOv5 on each class of the dataset. Given that the regression of the 3D bounding box parameters relies on accurate RoI and classes for the objects, evaluating the precision of the 2D detector is a necessity. The error metrics are the Average Precision (AP), the Recall (R), and the mean Average Precision (\(mAP_{50}\)).

5.2.3 Distance estimation

For our distance estimator evaluation, we used the same evaluation as the ones used for image-level depth estimation methods. The metrics used include Absolute Relative Error (Abs Rel), the Squared Relative Error (SRE), the Root Mean Square Error (RMSE), the logarithmic RMSE (log RMSE), and the percentage of Bad Matching Pixels. Let \(z_{gt}\) and \(z_{pd}\) be, respectively, the ground truth and predicted distance of the object i, it is calculated using Eq. (9), where \(\delta =1.25^k\):

$$\begin{aligned} \alpha _{k=\left[ 1..3 \right] }= max(\frac{z_{gt}}{z_{pd}},\frac{z_{pd}}{z_{gt}})<\delta ^k. \end{aligned}$$
(9)

5.2.4 Dimensions

The dimension prediction evaluation is performed using the Dimension Score (DS) described by the authors of [23]. With \(V_{pd}\) and \(V_{gt}\) the predicted and ground truth volume of the object, the DS is computed using Eq. (10) averaged across all examples:

$$\begin{aligned} DS=min(\frac{V_{pd}}{V_{gt}},\frac{V_{gt}}{V_{pd}}). \end{aligned}$$
(10)

5.2.5 Object center

The object center predictions were evaluated with the Center Score (CS) as described in [23]. Assuming x and y the projected center coordinates in pixels and w and h the width and the height of the 2D bounding box, CS is computed with Eq. (11):

$$\begin{aligned} CS=(2+cos(\frac{x_{gt}-x_{pd}}{w_{pd}})+cos(\frac{y_{gt}-y_{pd}}{h_{pd}}))/4. \end{aligned}$$
(11)

5.2.6 Orientation

To evaluate the orientation predictions, we use the Orientation Score (OS) averaged across all examples. OS is calculated using Equation (12):

$$\begin{aligned} OS=(1+cos(\alpha _{gt}-\alpha _{pd}))/2. \end{aligned}$$
(12)

5.2.7 Computation time

To evaluate the computation time of our models, we compute the average time necessary for a forward pass (including Non-Max-Suppression) for one single image during inference. We tested all the methods presented on an Nvidia RTX 3080 GPU. We also included computation time on an embedded Nvidia Jetson TX2 board as shown in Table 4.

Table 1 Quantitative results for the Car class from our evaluation on the KITTI dataset. We can see that our method benefits from being pre-trained on our GTA dataset and manages to outperform other state-of-the-art methods, especially for moderate and hard targets. The Car class on the KITTI dataset represents \(\sim 82\%\) of all the dataset objects
Table 2 Quantitative results for the Person class from our evaluation on the KITTI dataset. We can see that our method benefits from being pre-trained on our GTA dataset. The Person class on the KITTI dataset represents \(\sim 13\%\) of all the dataset objects.
Table 3 Quantitative results for the Cycle class from our evaluation on the KITTI dataset. We can see that our method benefits from being pre-trained on our GTA dataset. The Cycle class on the KITTI dataset represents \(\sim 5\%\) of all the dataset objects
Table 4 Computing time on the Nvidia Jetson TX2 embedded board. Our results show that our new Small SA model is the most appropriate for real-time embedded applications with 60ms/img (17 fps)

5.3 Result analysis

We present the quantitative results of our models on the KITTI dataset using the official 3D AP metric in Table 1 for the Car class, Table 2 for the Person class, and Table 3 for the Cycle class. We present our results along with state-of-the-art methods for comparison. These results show that our approach, although less complex than state-of-the-art methods, has accuracy comparable or even superior to them. Our model is also significantly faster than all other tested approaches, with a computation time of 11.2ms per image on our Large model when inferring on an RTX 3080 GPU. Our method is even able to achieve higher accuracy using pre-trained weights on our GTAV dataset, which shows the potential of video game datasets to improve 3D object detection methods on real datasets. Thanks to this, our “Large” model can outperform state-of-the-art methods in moderate and difficult scenarios. Finally, we also conducted experiments on a Jetson TX2 embedded board to confirm the applicability of our smallest model for real-time tasks for embedded applications. We present the results in Table 4.

Our new SA Small model results show that it performs the same, or worse when not using pre-trained weights, than the Small model when training at full resolution (\(1312\times 416\)). However, when using a smaller resolution (\(672\times 224\)) it significantly outperforms the Small model. It is a particularly interesting model for applications on systems where the computational capabilities are limited since its memory requirements are lower and it runs at 3.1ms/img while having greater accuracy than the Small and Medium models trained at the same resolution. Our tests with the Jetson TX2 also show that it is the fastest of our methods on this platform. The quantitative results of our models on both datasets for all classes along with the results of our baseline based on our previous multi-stage approach based on YOLOv5 are presented in Table 5 and in Fig. 5. We can see that we achieved significant accuracy gains by switching from a multi-stage approach based on YOLOv3 [40] from our previous approach [33] to an anchor-based approach using YOLOv5. We conducted the experiments using either the ROIs from YOLOv3 predictions or the ground truth for the feature alignment necessary for the regression of the 3D bounding box parameters. This to highlight the problem posed by ROIs variation in the 3D parameter regression in multi-stage approaches. We show that our new single-stage approach significantly outperforms our previous method in both cases, especially for depth estimation and orientation.

Fig. 5
figure 5

Depth RMSE error obtained by our different models along the baseline multi-stage [33] method for each class on the KITTI and GTA datasets. Our new method improves accuracy on the KITTI dataset significantly when we use weights pre-trained on our new GTA dataset. With these weights, our new model with Split Attention (SA) convolutions is able to reach the same performances as our medium-sized model

Table 5 Quantitative results from our evaluation on both the KITTI and our GTA datasets. Results from our previous method are displayed along with our new method. Our new method, significantly outperforms our previous approach, especially in terms of depth estimation accuracy

We also showcase the various models from our current approach (small, SA small, medium, large) and we can see that the precision between the different models for 2D detection, Dimensions, Center, and Orientation does not significantly benefit from the heavier models while Distance estimation improves when using larger models. Qualitative results on both KITTI and GTA datasets are presented in Fig. 6.

Fig. 6
figure 6

Qualitative results of our methods on both KITTI and GTA datasets. These images are extracted from the validation split of each dataset. We display the results of both our best model (large pre-trained on GTAV) and our new lightweight model (L-CNN) for embedded platforms (Small SA pre-trained on GTAV). 4 top lines: results obtained under GTAV dataset (left column: ground truth, middle column: Small SA prediction, right column: large model prediction), 4 bottom lines: results obtained under KITTI dataset (left column: ground truth, middle column: Small SA prediction, right column: large model prediction)

6 Conclusion and outlook

In this paper, we presented our new anchor-based 3D object detection method from single-image input using a modified YOLOv5 network. Along with this method, we also presented our new virtual dataset based on the video game GTAV for both road and rail environments, filling the gap in railway datasets for 3D object detection and localization. Through our work, we showed that state-of-the-art precision on the KITTI benchmark could be reached by adopting transfer learning from models trained on our new dataset while being the fastest method available for 3D object detection from images. Our approach is particularly well suited for embedded use, as demonstrated by our tests on the Nvidia Jetson TX2 embedded board. We also provided an extensive evaluation of our approach, along with our previous 2 stage network on both the KITTI and our own GTAV road/railway virtual dataset. These results showed that our method is suitable for railway applications in a virtual environment, yet we still need to prove the applicability of our light-weight network under real conditions. Our fastest model ( Small-SA ) has also an important gap in precision with our largest model (Large). To address these issues, we first plan to acquire our railway dataset with real-world images (as a multimodal dataset provided by 2 cameras, 1 LiDAR, 1 Radar, 1 GPS/IMU) to further validate our approach. To improve the precision of our smallest model (Small-SA) we aim to use knowledge distillation to improve its precision and thus reduce the gap in precision without augmenting the computation time. We also aim to develop a lightweight 3D tracking algorithm and integrate it into our method to provide a complete method for road and rail safety and further improve detection accuracy using Multi-Task Learning (MTL).