1 Introduction

An intelligent transportation system (ITS) is the operational combination of a set of technologies that, when combined and managed in the right way, improve the performance and capabilities of the entire system. An ITS improves the transport system of a city or an intercity highway, making it safer and more efficient [16].

In intelligent transportation systems, as in traditional systems, the decisions taken to improve the mobility on public roads are based on traffic variables such as traffic flow and average speed, which at the same time are the objective of optimization. Some of these decisions can be the change of green times of traffic lights or the allowing of the right-of-way of lanes in specific directions, or merely the measured variables can be used to establish the performance of transit policies, analyzing their values before and after their implementation.

In some countries the vehicle counting is still performed manually to calculate urban flows. This process is performed by staff located at intersections on the main roads of the city, which draw strokes in formats of vehicular capacity, where each stroke represents a car that travels by that location [2]. This method presents an unknown level of error, increasing the uncertainty of the results. In order to improve the reliability of the data obtained, redundant measures are used; then, several people perform the same work at the same location and then their results are averaged. This method does not establish a known level of error; on the contrary, it increases the costs of the measurement.

This work presents a method for detection, tracking, counting and classification of urban traffic through digital image processing and convolutional neural networks, which aims to improve the disadvantages presented by the manual process. In Sect. 2 it is presented a state of the art that introduces the three main stages that make up the proposed method: detection, tracking and classification. Section 3 explains the techniques used in each stage of the algorithm. Finally, Sect. 4 presents the results obtained from the tests carried out in different locations in the city of Bogotá. Conclusions are presented in Sect. 5.

2 Related Work

Typically, the process of detecting and counting vehicles with artificial vision is divided into three main sections: detection, which establishes the location of the vehicles present in each frame of a video; tracking, which calculates the route followed by each of the vehicles in the scene, and classification, which identifies different types of vehicles. There are a large number of methods used to perform these tasks, here are some of them as well as several previous works:

2.1 Detection

A widely used method for object detection is background subtraction, which is based on the background modelling in a sequence of images. The segments corresponding to moving objects are calculated through the difference between each frame and the previously calculated model, which is later updated with the same frame. In [3], a vehicular detection is done using this method, carrying out the background modeling from a Gaussian Mixed Model (GMM). This technique models each of the background pixels as a mixture of Gaussian distributions. Therefore, pixels that do not adjust to such distributions are considered as moving objects [15].

In [19] the potential location of the vehicles in the image is made by previously placing the shadows of the objects in the scene, in order to corroborate the detection from symmetry and edge detection. A similar approach that adds texture analysis to the image is presented in [9]. Another approach for detection of objects is the eigen component search of the element to be detected; for instance, in [11] these components are searched from basic forms, such as circles to find the tires of a bicycle. In the case of vehicles, in [13] was presented a method based on the detection of the back of the car, specifically from the rear lights. The detection method used in this work is based on a cascade classifier previously trained through the algorithm presented by Viola and Jones in [17]. Some work done with this method is presented in [14, 18, 20].

2.2 Tracking

Tracking corresponds to trace the path of a moving object as it changes its location in a scene. The tracking process is mainly divided into three types: based on region, in which the deviation of the segment corresponding to the moving object is calculated; based on active contour, which calculates and updates a box enclosing the detected object, and based on characteristics, which traces and follows specific characteristics of the object [12].

In addition, a method must be established to estimate future positions of the object, in order to perform the tracking even when the detection step does not deliver the position in a frame. The Kalman filter is an estimator that uses measures observed in the past to make future predictions of the variable. [8] presents satisfactory results in the tracking of multiple moving objects. The particle filter can be similarly used to this application, which allows to simplify the traditional methods used by Kalman filter [1].

2.3 Classification

The classification of vehicles is made from machine learning algorithms, which are trained with large datasets of different types of cars, followed by the extraction of different characteristics, such as dimensions (length, width, area), border orientation histogram, HAAR features, color, among others. These characteristics are applied to a classifier like a neural network, a support vector machine, a boost classifier, among others. Ojha y Sakhare present in [10] a summary of works done with different types of classifiers and characteristics.

Sometimes, the choosing of the correct type of feature presents a challenge, takes a long time, and does not always give good results. The Convolutional Neural Networks (CNNs) handle that problem using convolutional layers, extracting features through spatial filters, whose weights are learned in the same way as the other networks parameters. [4, 5] present some works where CNNs are used for the classification of vehicles.

3 Proposed Method

The proposed procedure for the detection, tracking, counting and classification of vehicles is summarized in Fig. 1. This procedure is performed independently for cars and motorcycles.

Fig. 1.
figure 1

Proposed method.

Viola-Jones Algorithm. The first stage of detection is based on the Viola-Jones algorithm [17], that is based on a cascade classifier that uses HAAR features, as those shown in the Fig. 2, where the feature value is the difference between the sum of the pixels under the black and white regions.

Fig. 2.
figure 2

Example of Haar features

The set of characteristics are preselected through an AdaBoost learning algorithm. The process is carried out by sweeping a detection window along the image, and processing it through the stages of the classifier as shown in the Fig. 3, establishing whether the object is in a specific position from the result of the classifier. This algorithm offers high detection rates with very short processing times.

Fig. 3.
figure 3

Cascade classifiers.

Kalman Filter. Each of the detected positions are applied to a Kalman filter, in order to track the vehicle and estimate the unknown locations in the frames where the detector does not deliver them. The counting of a vehicle is done when its position leaves a region of pre-established interest.

Classification. Classification is only performed for cars. A color classifier is applied to determine if a detection corresponds to a taxi (taxis are yellow in Colombia), if not, the detection (snapshot of the detected car) is applied to the convolutional neural network, that classifies it between bus, microbus, minivan, sedan, SUV and truck.

For the purpose of this work, it was used a variation of the AlexNet Convolutional Neural Network [6], pretrained with the ImageNet dataset [7], which contains millions of images from 1000 categories. The net architecture is shown in the Fig. 4.

Fig. 4.
figure 4

Variation of the AlexNet Convolutional Neural Network [6].

4 Experiments and Results

The following subsections present the trainings and results obtained from the stages of the algorithm, as well as the total performance.

4.1 Cascade Classifier Training

The training of the cascade classifier was performed with a bank of images of 12500 positive samples (vehicles) and 14000 negative samples (houses, buildings, people, animals, empty streets, etc.). Figure 5 presents some examples of positive samples.

Fig. 5.
figure 5

Examples of vehicle dataset.

In order to find the appropriate training parameters, sweeps were performed on the following parameters. The chosen parameters gave a detection rate of 74.9% and a false positive rate of 1.4%.

  • Number of stages: from 15 to 30. Chosen: 20

  • Types of features: HAAR, HOG and LBP. Chosen: HAAR

  • Detection window size: 12 \(\times \) 24, 18 \(\times \) 18, 18 \(\times \) 24, 18 \(\times \) 36, 24 \(\times \) 24. Chosen: 18 \(\varvec{\times }\) 24

  • Type of boosting: DAB, RAB, LB, GAB. Chosen: GAB.

4.2 Classification

The training of the color classifier was done with a bank of images of 295 positive (taxis) and 713 negative (non-taxis), obtaining an accuracy of \(99.3\%\) with the test dataset.

According to the Fig. 4, the input of the CNN classifier image must be \(224\times 224\), and the original output size is 100. However, the last fully connected layer was changed to have 6 outputs, and a fine-tuning was performed with the BIT dataset [4] (558 buses, 883 microbuses, 476 minivans, 5922 sedans, 1392 SUVs and 822 trucks). It was used 80% of the dataset for training and validation, and the remaining 20% for testing, getting the learning curve presented in the Fig. 6, and the confusion matrix of the Table 1. There, can be observed that around 300 iterations the training converged with loss around 0.1 and training precision around 97%. The testing accuracy was 88% on average for all the classes.

Fig. 6.
figure 6

Learning curve of the CNN fine-tuning.

Table 1. Confusion matrix got from the CNN testing.

4.3 Performance of the Complete Method

In order to evaluate the performance of the complete method, traffic videos were captured at different locations in the city of Bogotá. Then, the total number of vehicles was counted manually, how many of them were taxis, and the number of motorcycles that crossed the captured road area. Subsequently, the videos were processed by the proposed method. Table 2 shows the results obtained for each of the videos with 640 \(\times \) 480 video resolution, regarding the manual count. Some time and site conditions were changed in order to show the working of the method in different situations. These features are:

  • Video 1- Main 4 lane road. Hour: 15:20. Weather: cloudy. Average speed: 27 km/h.

  • Video 2- Main 4 lane road. Hour: 10:30. Weather cloudy. Average speed: 12 km/h.

  • Video 3- Main 4 lane road. Hour: 14:12. Weather: sunny. Average speed: 17 km/h.

  • Video 4- Main 2 lane road. Hour: 11:00. Weather: cloudy. Average speed: 22 km/h.

  • Video 5- Main 2 lane road. Hour: 15:00. Weather: cloudy. Average speed: 24 km/h.

Table 2. Performance of the complete method with the 640 \(\times \) 480 video.

The system was implemented in a Cubieboard 4 and a surveillance camera with 640 \(\times \) 480 pixels of resolution. It was obtained detection and tracking processing times around 11.11 ms, appropriate for a real time application. In addition, it was got classification times around 76 ms, which allowed implementing it with a FIFO approach in an independent thread.

5 Conclusions

A method of measuring urban traffic through digital image processing was proposed, implemented and verified, which is divided into three main stages. The detection was based on a cascade classifier and HAAR characteristics, obtaining a detection rate of 74,9% and a false positive rate of 1,4%. The tracking and counting stage, performed from the implementation of a kalman filter, allows to increase the previous detection rate, obtaining an average error of 2,0% in the total vehicle count and 5,5% in motorcycles count. The Convolutional Neural Network presented an average precision of 88,0% in the tests performed. Additionally, the execution times allow the implementation of the system in a commercial embedded platform.

In general, it can be concluded that the proposed method presents appropriate results without the need of high image resolution, allowing its execution on platforms that do not have a high computational capacity, generating the possibility of implementing low cost traffic measurement systems, in which the analysis can be performed locally.