Transfer Learning Using Deep Neural Networks for Classification of Truck Body Types Based on Side-Fire Lidar Data

  • Reza Vatani NezafatEmail author
  • Olcay Sahin
  • Mecit Cetin
Original Paper


Vehicle classification is one of the most essential aspects of highway performance monitoring as vehicle classes are needed for various applications including freight planning and pavement design. While most of the existing systems use in-pavement sensors to detect vehicle axles and lengths for classification, researchers have also explored traditional approaches for image-based vehicle classification which tend to be computationally expensive and typically require a large amount of data for model training. As an alternative to these image-based methods, this paper investigates whether it is possible to transfer the learning (or parameters) of a highly accurate pre-trained (deep neural network) model for classifying truck images generated from 3D-point cloud data from a LiDAR sensor. In other words, without changing the parameters of several well-known convolutional neural networks (CNNs), such as AlexNet, VggNet and ResNet, this paper shows how they can be adopted to extract the needed features to classify trucks, in particular trucks with different types of trailers. This paper demonstrates the applicability of these CNNs for solving the vehicle classification problem through an extensive set of experiments conducted on images created based on data from a LIDAR sensor. Results show that using pre-trained CNN models to extract low-level features within images yield significantly accurate results, even with a relatively small size of training data that are needed for the classification step at the end.


Transfer learning LiDAR Convolutional neural network Freight monitoring 

Introduction and Background

The exponential growth of technology in the last 50 years has resulted in an unprecedented increase in the number of vehicles on the road. Therefore, the need for innovative solutions to increase the performance of transportation systems is at highest. Nearly all traffic management approaches have a monitoring mechanism. The purpose of this mechanism can be an estimation of macroscopic/microscopic traffic parameters or classification of vehicle types, which are useful in many applications such as freight planning, highway design and maintenance, traffic operations, transportation planning, and tolling. Some of the commonly used vehicle detection technologies such as inductive magnetic loop detectors (Mita and Imazu 1995), magnetic sensors (He et al. 2012), acoustic sensors (Wang et al. 2014), infrared sensors (Tropartz et al. 1999), and weigh-in-motion sensors (Nichols and Cetin 2007) are already used in practice to count the number of axels or capture other physical features needed for the classification algorithms.

In addition to the vehicle detection technologies listed above, surveillance cameras have been another common option to monitor traffic flow since they have a relatively low maintenance cost and do not necessitate traffic disruption during installation. Because of the widespread use of cameras, researchers within the transportation community have tried various image/video-based vehicle classification approaches (Hsieh et al. 2006). Early studies have tried heuristic strategies to do classification task using low-level features (Gupte et al. 2002). Their approach involves three levels: raw images, region level, and vehicle level. Later, researchers have tried using machine learning algorithms, such as Support Vector Machines (SVM), for vehicle classification. More recently, researchers have classified vehicle images into four categories (motorcycle, car, lorry, and background) using a histogram of the oriented gradient to train SVM with nonlinear kernel functions (Ng et al. 2014). Others have tried the active learning approach to classify vehicles under different conditions such as traffic level, road illumination, and weather conditions (Sivaraman and Trivedi 2010). Statistical models such as a hybrid dynamic Bayesian network or Gaussian mixture model have also been investigated. These models can reach high accuracy using low-level features but they are highly dependent on selected features (Chen et al. 2012; Kafai and Bhanu 2012). Although image/video base models are accurate, in practice they face some performance limitations in low illumination and bad weather conditions. Hence, it is important to investigate alternative types of technologies such as 3D scanners to improve the performance of these monitoring systems which are popular tools for pavement studies in transportation (Chang et al. 2005; Mahmoudzadeh et al. 2015).

In recent years, the light detection and ranging (LiDAR) technology has been studied in many different transportation applications such as Advanced Driving Assistance Systems (ADAS) to improve safety (Khattak et al. 2003; Tsai et al. 2011; Veneziano et al. 2003), roadway design (Souleyrette et al. 2003), or tracking individual vehicles in the traffic stream (Sazara et al. 2017) LiDAR systems have proliferated within the last several years as a key (distance/range) measuring technology for automated vehicles (Choi et al. 2013; Zhang 2010). In most of these studies, the LiDAR mounted on the vehicle captures data from the surrounding environment. Some researchers have used aerial LiDAR platforms for monitoring traffic (He et al. 2017; Yao et al. 2008). In addition, there are studies that use a static LiDAR to detect vehicles passing a specific point on the roadway. For instance, in 2012, Lee and Coifman mounted two LiDARs on a vehicle with a 40 ft gap in between. They parked the vehicle on the roadside to first extract the background and then with a segmentation algorithm detect the vehicles in the traffic stream. Using geometric features, they developed a heuristic-based vehicle classification algorithm (Lee and Coifman 2012). Another study has implemented one LiDAR on an overhead bridge to detect and count motorcycles (Prabhakar et al. 2013). More recently, a study used a Velodyne VLP_16 LiDAR mounted on a traffic signal pole to detect vehicles within the intersection (Aijazi et al. 2016).

This paper focuses on the feasibility of implementing deep neural networks for classification of vehicles using side-fire LIDAR data. Rather than solving the typical classification problem of categorizing vehicles (e.g., into cars, small trucks, large trucks, etc.), this study is on detecting truck body types. More specifically, the main objective is to distinguish between four types of trailers/trucks: a truck carrying an intermodal container versus a dry or enclosed van, each with or without an attached refrigerator unit. In most urban areas in the USA, especially those with intermodal ports, these body types constitute a large percentage of all FHWA Class 9 trucks. Compared to other body types, such as a tank, dump trailer, and auto transporter, these selected four body types (i.e., enclosed van, refrigerated enclosed van, intermodal container, and refrigerated intermodal container) are more challenging to classify due to the high-degree of similarities in their shapes and sizes. Being able to classify truck body types is important for freight planning and commodity flow modeling since body configuration can be linked to the types of commodity being hauled. Because of this, there has been recent interest in classifying truck body types into distinct subclasses based on data from inductive loop signatures and weigh-in-motion sensors (Hernandez et al. 2016).

The Artificial Neural Network (ANN) idea has been around since 1960 (Lippmann 1987). There have been many studies using ANN as a machine learning approach in all engineering fields including transportation community (Faghri and Hua 1992). The advantage of ANN over classical machine learning approaches such as SVM is the ability of this method to perform feature extraction and selection automatically. To get the best performance out of classical methods, researchers need to select the best features to represent the data, which is time-consuming and involves some heuristic procedures. In a sense, for classical methods, some part of the learning has to be done by the researcher. Up to a few years ago, the performance of ANNs and classical methods were almost the same. With the recent advancements in computational power and an increase in the accumulation of data, researchers have noticed an interesting pattern in the performance of machine learning algorithms. The performance of ANNs increases rapidly with more data while the performance of classical methods would not get better after a certain point. This observation has led to increase in new studies about ANNs in the field of computer science. For some tasks, such as image retrieval (Ku et al. 2015), object detection, and tracking (Girshick 2015; Ren et al. 2015) ANN has reached a state of the art performance. Convolutional Neural Network (CNN) image classification is one of the most successful methods in neural network research. It can find the properties of different categories much more accurate than other methods. The drawback of CNN is the need for the tremendous amount of training data and computation power because the model has to optimize numerous parameters in its network structure (Russakovsky et al. 2015). Recently, this method got popular among transportation researchers. One study was able to detect cracks on hot-mix asphalt and Portland cement concrete using pavement images with the help of deep CNN (Gopalakrishnan et al. 2017). Other researchers have used CNNs for vehicle detection based on satellite images (Chen et al. 2014) and vehicle classification based on video from surveillance camera (Adu-Gyamfi et al. 2017).

The training process of deep CNNs is time-consuming and needs a huge amount of computational power. Moreover, it can easily lead to overfitting. Some researchers have tried pre-training and fine-tuning strategy to overcome this limitation (Zhuo et al. 2017). They have pre-trained a GoogleNet model (Szegedy et al. 2015) on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 dataset to find the initial model. Then using 13,700 images extracted from surveillance cameras, they have fine-tuned the initial model. This approach has reached around 98% accuracy in classification of vehicles. This strategy would solve the overfitting problem, but the pre-training procedure is still computationally intensive. By visualizing learned features of CNN (Zeiler and Fergus 2014), researchers have noticed that the network always learn low-level features at the beginning layers and consecutively features would become more complex as you go deeper into the model. Independent from the dataset, low-level features are always the same for almost any type of images. It would be intuitive to keep low-level features learned from one dataset and transfer that knowledge to perform classification on different dataset. This method uses the feature descriptor parts of an already existing trained model such as AlexNet (Krizhevsky et al. 2012) and replaces the classifier part with a new task-specific model. This type of modeling practice is called “transfer learning.” Many researchers have used transfer learning to improve the accuracy and proficiency of new models with limited training data (Donahue et al. 2014; Hu et al. 2015; Pan and Yang 2010; Razavian et al. 2014).

There are many CNN models trained on ImageNet dataset (Russakovsky et al. 2015) that can be used as pre-trained CNN including AlexNet (Krizhevsky et al. 2012), VGGNet (Simonyan and Zisserman 2014), and ResNet (He et al. 2016). In this paper, we have investigated the implementation of transfer learning on these three pre-trained models to find the best performance for the classification of truck images generated from data collated by a 3D LiDAR sensor.


In the transfer learning method, a pre-trained model would be chosen as a feature descriptor or extractor. Over the years, researchers developed various architectures of deep learning to solve challenging object recognition problems. Of these, VGGNet, AlexNet, and ResNet are well-known and popular as each of these achieved high levels of accuracy in classifying objects in large image databases. Therefore, in this paper, to find out which pre-trained model would be more suitable for transferring the learnings to the truck trailer classification, three deep CNN models, i.e., VGGNet, AlexNet, and ResNet, have been investigated. Features extracted from these pre-trained models become the selected features and can be used as input to any classification algorithm. In our previous study (Vatani Nezafat et al. 2018), we demonstrated how the features from a pre-trained ResNet model could be utilized for truck classification based on camera images. The features from ResNet are used as input to a multi-layer perceptron (MLP) neural network which is shown to outperform other machine learning algorithms such as SVM or Kth Nearest Neighbors (Vatani Nezafat et al. 2018). In this paper, we extend the transfer learning idea for solving the truck classification problem by considering additional well-known deep NNs and applying them to a new type of dataset, i.e., LIDAR point cloud data expressed as an image.

All three CNN models listed above are already trained on the ILSVRC dataset. Since the model is pre-trained, extractions of the already learned features and using them directly will save a great amount of computation time. However, features in a CNN grow in complexity as we step deeper into the network. Therefore, a key task is determining the optimal point at which the pre-trained model structure should be cut or stopped to get the right level of feature complexity for our task. Four different positions for the feature extraction has been investigated on all three models as shown in Figs. 1 and 3. The features extracted from these models are used as the feature descriptors for the respective MLP classifier which has two fully connected layers. Hyper parameter optimization of this classifier was done through a simple grid-search to find the optimum number of hidden units in each fully connected layer. The implemented MLP classifier has 128 hidden units for the first layer and 64 hidden units for the second one, the learning rate is 0.001, and 200 epochs of training was done. Some specific details of the three deep networks are provided below.
Fig. 1

The architecture of AlexNet (left) and VGGNet (right)

In 2012, Krizhevsky et al. have developed AlexNet which outperformed all the models at the ILSVRC 2012 competition. AlexNet is one of the most popular CNNs and usually would be considered as a baseline model. It has five convolutional layers followed by three fully connected layers. The size of the input image for this model is \(227 \times 227 \times 3\) and consecutively, number of units in each layers are 96, 256, 384, 384, 256, 4096, 4096, and 1000. The first-, second-, and fifth-convolutional layers are followed by maxpooling layers. The first and second maxpooling layer is followed by local contrast normalization. The nonlinear activation function of each unit is the rectified linear units (ReLUs). The architecture of the model with the details and proposed position for feature extraction is shown in Fig. 1.

The VGGNet had an outstanding performance in the ILSVRC 2014 competition by winning the second place. It consists of 16 convolutional layers and the structure is uniform. The model characteristics are similar to AlexNet, but number of filters is more. One drawback of this model would be the big structure, which has around 138 million parameters. It takes lots of computation power to train this model on ImageNet dataset. The high number of filters in this model makes it one of the most preferred choices as a feature extractor. The architecture of the model with the details and proposed position for feature extraction is shown in Fig. 1.

ResNet model proposes residual learning blocks to solve the degradation problem caused by multiple nonlinear layers. Using residual learning, if the optimal solution for a specific case is closer to an identity mapping (i.e., the output is a slightly altered version of the input), the solvers can reach it by simply driving the weights of the multiple nonlinear layers toward zero. This way, the solver should converge easier by retaining the input rather than learning the function like a new one. The mathematical formulation of the added residual learning units can be expressed as:
$$y = F\left( {x, \left\{ {W_{i} } \right\}} \right) + x,$$
where \(x\) and \(y\) are, respectively, the input and output vectors of the layers considered; and the function \(F\left( {x, \left\{ {W_{i} } \right\}} \right)\) represents the residual mapping to be learned. The architecture of this building block is represented in Fig. 2. The added shortcut solves the degradation problem without introducing extra parameters or computation complexity. The ResNet architecture used in this article has 151 convolutional layers and a final dense layer with a Softmax activation function. The structure of the model, along with its respective hidden units, is presented in Fig. 3. As it can be seen, a residual learning block is defined for every few stacked layers (yellow boxes). Building blocks are shown in white boxes with the numbers of residual blocks stacked written on the right (i.e., \(\times 3)\). Down-sampling is done by blocks conv3_1, conv4_1, and conv5_1 with a stride of 2.
Fig. 2

Identity block

Fig. 3

The architecture of ResNet

In summary, Table 1 Some of the characteristics of the pre-trained models presents some of the characteristics of these three models.
Table 1

Some of the characteristics of the pre-trained models





Number of layers




Number of convolutions




Number of parameters (M)




Dataset and Computational Configuration

The data for this paper are collected at the westbound direction of I-64 near the Hampton Roads Bridge-Tunnel (HRBT) in southeast Virginia. This highway has two lanes at this location. Three surveillance cameras and one LiDAR have been mounted on a roadside pole as shown in Fig. 4. The LiDAR is a remote sensing technology that sends laser beams and measures the distance to any obstacle that reflects its laser beams. It also captures the intensity of the reflection. In this article, a 3D Velodyne VLP-16 LIDAR is used. It has 16 laser beams with a 360º rotating unit. These laser beams have different pitch angles, in 2 degrees increment. It covers a maximum of 50 m from the sensor in the longitudinal direction of the roadway, but it will not result in a dense sets of points for each individual vehicle observed. In vertical configuration (Fig. 5b), it covers less than 3.5 m longitudinal section of the roadway and provides more dense points per vehicle. Most passenger vehicles can fit in this range but vehicles longer than 3.5 m will not. Therefore, multiple scans or frames need to be combined to create the full 3D or 2D profiles of trucks. For this research, the LIDAR is configured in the vertical orientation.
Fig. 4

Data collection site (left), LIDAR and cameras configuration on site (right)

Fig. 5

LiDAR in a horizontal and b vertical orientation

Since trucks are traveling only on the right lane, LIDAR points reflected from objects elsewhere can be excluded from the dataset. Thresholds are established to eliminate these redundant data points. Figure 6a shows a complete LIDAR scan, whereas Fig. 6b has the remaining data points after removing the filtered data. All the analyses are performed with the subset of points belonging to vehicles traveling on the right lane, as in Fig. 6b.
Fig. 6

Raw single LiDAR frame (a) and filtered single LiDAR frame (b)

The LiDAR collects the data with a frequency of 10 Hz which means every 0.1 s it takes a snapshot of the surrounding environment called a frame. In each frame, the point cloud data capture a portion of the vehicle passing the LiDAR detection zone. LiDAR frames belonging to each truck need to be identified or labeled. To do so, gaps between successive vehicles are detected simply by observing the LIDAR beams returning from the roadway surface. A vehicle counter is incremented each time a new vehicle enters the detection zone by breaking the LiDAR beams. All vehicles, including passenger cars, are then labeled with unique integer numbers. Using the speed estimation process explained below, vehicle lengths are then estimated. To filter out single unit trucks and passenger cars, vehicles that are not longer than 50 ft are removed from the dataset. LIDAR data for the remaining vehicles are processed and the types of trailers are manually labeled based on the video from surveillance cameras.

To generate the full truck profile, multiple frames need to be merged. This can be done if the speed at which the truck is traveling is known. Using the first two consecutive frames and the time instances when the truck enters the scan zone of each beam, the speed of the vehicle can be estimated since the distance between individual beams is known. Likewise, as the truck is departing the detection zone, the last two frames can be utilized in the same manner to estimate another speed. In fact, as long as the vehicle is not occupying the entire set of 16-beams in two consecutive frames, data from such frames can be used similarly to estimate speed. These speeds are then average to find a constant average speed for the vehicle. It should be noted that the precision of this method is limited since the distance can only be measured in increments of the distance between two consecutive beams. For the installation, this increment is about one foot. Given the fact that the time between two frames is 0.1 s, this discretized measurement of travel distance translates to approximately ± 7 mph maximum error (worst case) for a truck traveling around 50 mph. However, since multiple estimates are utilized the actual error is expected to be lower than this. The research team did not have a speed-measuring sensor at the site to quantify the error in the estimated speeds. Based on the estimated average speed, the frames belonging to the same trucks are then merged by shifting the consecutive frames accordingly. A typical FHWA Class 9 truck spends about 1–2 s within the LIDAR detection zone in free flow speed. Within this time, LIDAR generates around 30,000 points. Based on the estimated speed, we were able to reconstruct the profile of vehicles using data from all frames belonging to the same truck. Examples of 3D and 2D profiles of trucks are shown in Fig. 7. 2D-Profiles are used in this paper as the input images to the CNNs.
Fig. 7

The full truck profile generated by merging multiple frames; enclosed van (top row), container (bottom row)

There are many categories of truck body configurations that one can consider (Hernandez et al. 2016). In this paper, four of the most challenging categories have been selected for classification, i.e., trailers with containers and enclosed vans with or without refrigerator. As it is shown in Fig. 8, the four examined body types are very similar in structure and shape. The total number of truck profiles used in this paper is 4714 out of which 1628 are enclosed vans, 1032 are refrigerated enclosed vans, 1869 are containers and the rest are refrigerated container. The profiles have been saved as images resized to be consistent with the pre-trained input of CNNs. 80% of the data are used for training, and the rest is set aside to be used as the test data. Ground truth labeling of these images was done manually. All computations in this article were conducted with Tensorflow platform on Windows 7 OS with Intel Xeon E5-2630 2.40 GHz and an NVIDIA Quadro K4200 GPU with 4 GB memory. Since parameters of the pre-trained models are fixed and would not change during the training, it is intuitive to do the feature extraction only once and save them to perform the training faster. Table 2 presents computation time needed for feature extraction of each algorithm on four level of complexity on all datasets. While the number of parameters for ResNet and AlexNet are almost the same (Table 1) but their computation time is significantly different which is due to the deep network of ResNet (152 layers). The VGGNet is not as deep as ResNet, but still has a significant difference with the AlexNet in terms of computation time which is because it has a much wider network (number of parameters).
Fig. 8

Sample truck types and their projected merged cloud point data onto a 2D surface

Table 2

Computation time for feature extraction for all images in the dataset

Level of complexity

Computation time (min)
























Results and Discussion

Four different placements of the MLP classifier, as shown in Figs. 1 and 3, have been tested to identify the optimal point at which the pre-trained structure should be cut. A K-fold cross-validation procedure has been used to measure accuracy with K equals to 5. It means that the dataset was divided into five portions. Five different models were developed and each time one of the sections was considered as test data and the rest were considered to be the training data. Accuracy is the fraction of correct predictions over total number of predictions. This method of accuracy calculation ensures that the model is not over fitted. The average accuracy results for fivefold cross-validation of each model on the test data are presented in Table 3.
Table 3

Average accuracy (%) of fivefold cross-validation for proposed positions for the classifier





















In all models, the Classifier_1 is the worst performing model in the classification task. Features are primary and basic at this level of the network and the Classifier_1 struggles to correctly identify the correct truck types based on these features. Examples of possible features at this level will be a color change, the shape of lines, edges, etc. It is evident that it is hard to identify between a container and an enclosed van using these simplistic features. It is evident that ResNet and VGGNet models have slightly better representation of features than AlexNet at this level (Classifier_1). Features grow in complexity as we go deeper in the network and Classifier_2 will get more complex features from the pre-trained CNN compared to Classifier_1. By the same logic, Classifier_3 and Classifier_4 should be more accurate than their proceeding peers. However, the performance of the last two proposed placements are lower than Classifier_2. This happens because the features beyond Classifier_2 are becoming unnecessarily complicated for the classifier to distinguish between these two vehicle classes. The features at these deeper layers might be appropriate for the wide range of images they are trained on but not for the relatively simpler problem being solved in this paper. Even features at Clasifier_2 of VGGNet model is still too complex to reach best performance. The VGGNet model has its best performance on Classifier_2 but still it is significantly lower than AlexNet and ResNet models on the same level. This shows that complexity of features for ResNet model grows at slower pace than VGGNet model for this problem. This trend is consistent between all three models. In other words, the optimal point for selecting the best-suited features for classification of vehicle classes is about the first half of the given network. In this case, the first 33 layers of ResNet_152, the first 7 layers of VGGNet, and the first 3 layers of AlexNet have the best performance for the feature extraction task.

The convergence of model accuracy for the Classifier_2 of ResNet as an example is presented in Fig. 9. An epoch is when all the training samples are used once to update the weights of the MLP by the optimization algorithm that iteratively improves the model variables (e.g., weights). The accuracy of test data follows approximately the same trend as the accuracy of training data, and after around 100 epochs the model becomes steady.
Fig. 9

Convergence of the MLP model accuracy for Classifier_2 of ResNet

Since five models were developed to perform fivefold cross validation analysis, all of the data goes through testing process once. The confusion matrix of each model summarizing the testing results of fivefold cross validation is shown in Table 4. The whole numbers correspond to the number of samples and the values in parenthesis are percentage of them with respect to the total number of samples in that category (i.e., total number of samples in the row). It is evident that all three models produce comparable accuracies and misclassification is a slightly skewed towards containers. The accuracies are a bit lower for refrigerated trailers (i.e., Ref Containers and Ref Enclosed VAs). For example, in the AlexNet model, the containers are classified with 98% accuracy whereas refrigerated containers with 90%. Perhaps, the relatively lower accuracy in the refrigerated containers category could be attributed to the lower number of samples in this category (201 samples) as compared to the other three. It should be noted that the misclassifications are almost always between a trailer type and its refrigerated counterpart. If these are ignored, i.e., if the trailer type and its refrigerated counterpart are considered as one class, the AlexNet model produces about 98% accuracy in distinguishing between these two more aggregate classes. This level of accuracy is slightly higher than what the authors obtained (i.e., 96%) in a previous study (Vatani Nezafat et al. 2018) where camera images are used for classification of trailers into these two groups (i.e., container versus enclosed van). However, image-based classification was not applied to further subclassify trailers based on the refrigerated units. Depending on the camera angle, detecting the refrigeration unit from the images would be challenging. However, the 3D point cloud data from LIDAR would allow generating detailed profiles of trucks to detect these units and other details as demonstrated in this paper.
Table 4

The confusion matrix of classifier_2 of all three models




Ref container

Ref enclosed van

Enclosed van




1832 (98.38%)

7 (0.38%)

3 (0.16%)

20 (1.07%)

  Ref container

20 (9.95%)

181 (90.05%)

0 (0%)

0 (0%)

  Ref enclosed van

1 (0.1%)

0 (0%)

1003 (95.70%)

44 (4.20%)

  Enclosed van

14 (0.87%)

1 (0.06%)

24 (1.5%)

1564 (97.57%)




1802 (96.78%)

32 (1.72%)

5 (0.27%)

23 (1.24%)

  Ref container

65 (32.34%)

136 (67.66%)

0 (0.00%)

0 (0.00%)

  Ref enclosed van

6 (0.57%)

0 (0.00%)

976 (93.13%)

66 (6.30%)

  Enclosed van

19 (1.19%)

1 (0.06%)

106 (6.61%)

1477 (92.14%)




1821 (97.80%)

5 (0.27%)

2 (0.11%)

34 (1.83%)

  Ref container

31 (15.42%)

169 (84.08%)

0 (0.00%)

1 (0.5%)

  Ref enclosed van

2 (0.19%)

0 (0.00%)

1006 (95.99%)

40 (3.82%)

  Enclosed van

19 (1.19%)

1 (0.06%)

30 (1.87%)

1553 (96.88%)


In this paper, a transfer learning model is developed for classification of truck body types based on LiDAR data. Since the simple or basic features for any type of image datasets are the same, it was possible to transfer features learned by a pre-trained model (ResNet_152, VGGNet, AlexNet) to another classifier and build highly accurate models with relatively small datasets. Four of the most challenging categories of trucks are chosen to train the model. Four different experiments are conducted to find the optimal level of complexity for transferring learned features. It is shown that the first 33 layers of the ResNet_152, the first 7 layers of VGGNet, and the first 3 layers of AlexNet have the best performance on this dataset. With any one of the three CNNs analyzed here, as long as the optimal point for transferring the learning (or features) is selected, it is possible to get around 97% classification accuracy. The computation time for feature extraction shows that it is better to use a shallower CNN models with less parameters for this specific problem since the accuracy of more complex models are almost the same. Results from the confusion matrices show that the models are very accurate (~ 98%) in distinguishing between enclosed vans and containers, and trucks with refrigerator units (the refrigerator containers or refrigerator enclosed vans) are more prone to be misclassified. The AlexNet model was found to be computationally more efficient to implement and yielded classification accuracies higher than 90% for each one of the four truck body types.

This research is planned to be extended in the future and applied to larger datasets. A similar strategy can be applied to other categories of trucks or cars. Furthermore, the effects of collecting data under different weather or visibility conditions could be investigated. For the future studies, these complex conditions and additional categories of truck body types will be considered.



This research was funded by the Mid-Atlantic Transportation Sustainability University Transportation Center (MATS UTC). The authors also would like to thank the Virginia Department of Transportation (VDOT) for assisting in data collection.


  1. Adu-Gyamfi YO, Asare SK, Sharma A, Titus T (2017) Automated vehicle recognition with deep convolutional neural networks. Transp Res Rec 2645:113–122CrossRefGoogle Scholar
  2. Aijazi AK, Checchin P, Malaterre L, Trassoudaine L (2016) Automatic detection of vehicles at road intersections using a compact 3D Velodyne sensor mounted on traffic signals. In: Intelligent vehicles symposium (IV), 2016 IEEE, 2016. IEEE, pp 662–667Google Scholar
  3. Chang K, Chang J, Liu J (2005) Detection of pavement distresses using 3D laser scanning technology. In: Computing in civil engineering (2005), pp 1–11Google Scholar
  4. Chen Z, Ellis T, Velastin SA (2012) Vehicle detection, tracking and classification in urban traffic. In: Intelligent transportation systems (ITSC), 2012 15th international IEEE conference on, 2012. IEEE, pp 951-956Google Scholar
  5. Chen X, Xiang S, Liu C-L, Pan C-H (2014) Vehicle detection in satellite images by hybrid deep convolutional neural networks. IEEE Geosci Remote Sens Lett 11:1797–1801CrossRefGoogle Scholar
  6. Choi J, Ulbrich S, Lichte B (2013) Maurer M Multi-target tracking using a 3d-lidar sensor for autonomous vehicles. In: Intelligent transportation systems-(ITSC), 2013 16th international IEEE conference on, 2013. IEEE, pp 881–886Google Scholar
  7. Donahue J, Jia Y, Vinyals O, Hoffman J, Zhang N, Tzeng E, Darrell T (2014) Decaf: A deep convolutional activation feature for generic visual recognition. In: International conference on machine learning, 2014. pp 647–655Google Scholar
  8. Faghri A, Hua J (1992) Evaluation of artificial neural network applications in transportation engineering. Transp Res Rec 1358:71Google Scholar
  9. Girshick R (2015) Fast r-cnn arXiv preprint arXiv:150408083
  10. Gopalakrishnan K, Khaitan SK, Choudhary A, Agrawal A (2017) Deep Convolutional Neural Networks with transfer learning for computer vision-based data-driven pavement distress detection. Constr Build Mater 157:322–330CrossRefGoogle Scholar
  11. Gupte S, Masoud O, Martin RF, Papanikolopoulos NP (2002) Detection and classification of vehicles. IEEE Trans Intell Transp Syst 3:37–47CrossRefGoogle Scholar
  12. He Y, Du Y, Sun L (2012) Vehicle classification method based on single-point magnetic sensor. Proc Soc Behav Sci 43:618–627CrossRefGoogle Scholar
  13. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. pp 770–778.
  14. He Y, Song Z, Liu Z (2017) Highway asset inventory data collection using airborne LiDAR. In: SELECT annual meeting and technology showcase–Logan, Utah, 27–28 Sept 2016Google Scholar
  15. Hernandez SV, Tok A, Ritchie SG (2016) Integration of Weigh-in-Motion (WIM) and inductive signature data for truck body classification. Transp Res Part C Emerg Technol 68:1–21. CrossRefGoogle Scholar
  16. Hsieh J-W, Yu S-H, Chen Y-S, Hu W-F (2006) Automatic traffic surveillance system for vehicle tracking and classification. IEEE Trans Intell Transp Syst 7:175–187CrossRefzbMATHGoogle Scholar
  17. Hu F, Xia G-S, Hu J, Zhang L (2015) Transferring deep convolutional neural networks for the scene classification of high-resolution. Remote Sens Imag Remote Sens 7:14680Google Scholar
  18. Kafai M, Bhanu B (2012) Dynamic Bayesian networks for vehicle classification in video. IEEE Trans Ind Inf 8:100–109CrossRefGoogle Scholar
  19. Khattak A, Hallmark S, Souleyrette R (2003) Application of light detection and ranging technology to highway safety. Transp Res Rec J Transp Res Board 1836:7–15CrossRefGoogle Scholar
  20. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, 2012. pp 1097–1105Google Scholar
  21. Ku W-L, Chou H-C, Peng W-H (2015) Discriminatively-learned global image representation using CNN as a local feature extractor for image retrieval. In: Visual communications and image processing (VCIP), 2015, 2015. IEEE, pp 1–4Google Scholar
  22. Lee H, Coifman B (2012) Side-fire lidar-based vehicle classification. Transp Res Rec 2308:173–183CrossRefGoogle Scholar
  23. Lippmann R (1987) An introduction to computing with neural nets. IEEE Assp Mag 4:4–22CrossRefGoogle Scholar
  24. Mahmoudzadeh A, Yeganeh SF, Golroo A (2015) Kinect, a novel cutting edge tool in pavement data collection. Int Arch Photogramm Remote Sens Spat Inf Sci 40:425CrossRefGoogle Scholar
  25. Mita Y, Imazu K (1995) Range-measurement-type optical vehicle detector. In: Pacific rim TransTech conference. 1995 vehicle navigation and information systems conference Proceedings. 6th International VNIS. A Ride into the Future.
  26. Ng LT, Suandi SA, Teoh SS (2014) Vehicle classification using visual background extractor and multi-class support vector machines. In: The 8th international conference on robotic, vision, signal processing & power applications, 2014. Springer, pp 221–227Google Scholar
  27. Nichols A, Cetin M (2007) Numerical characterization of gross vehicle weight distributions from weigh-in-motion data. Transp Res Rec J Transp Res Board 1993(1):148–154CrossRefGoogle Scholar
  28. Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22:1345–1359CrossRefGoogle Scholar
  29. Prabhakar Y, Subirats P, Lecomte C, Violette E, Bensrhair A (2013) A lidar-based method for the detection and counting of powered two wheelers. In: Intelligent vehicles symposium (IV), 2013 IEEE, 2013. IEEE, pp 1167–1172Google Scholar
  30. Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) CNN features off-the-shelf: an astounding baseline for recognition. In: Computer vision and pattern recognition workshops (CVPRW), 2014 IEEE conference on, 2014. IEEE, pp 512–519Google Scholar
  31. Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems, 2015. pp 91–99Google Scholar
  32. Russakovsky O et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115:211–252MathSciNetCrossRefGoogle Scholar
  33. Sazara C, Nezafat RV, Cetin M (2017) Offline reconstruction of missing vehicle trajectory data from 3D LIDAR. In: 2017 IEEE intelligent vehicles symposium (IV), 2017. IEEE, pp 792–797Google Scholar
  34. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:14091556
  35. Sivaraman S, Trivedi MM (2010) A general active-learning framework for on-road vehicle recognition and tracking. IEEE Trans Intell Transp Syst 11:267–276CrossRefGoogle Scholar
  36. Souleyrette R, Hallmark S, Pattnaik S, O’Brien M, Veneziano D (2003) Grade and cross slope estimation from LiDAR-based surface models (No. MTC Project 2001-02,).
  37. Szegedy C et al (2015) Going deeper with convolutions. In: 2015 IEEE conference on computer vision and pattern recognition (CVPR), pp 1–9Google Scholar
  38. Tropartz S, Horber E, Gruner K (1999) Experiences and results from vehicle classification using infrared overhead laser sensors at toll plazas in New York City. In: Intelligent transportation systems, 1999. Proceedings. 1999 IEEE/IEEJ/JSAI international conference on, 1999. IEEE, pp 686–691Google Scholar
  39. Tsai Y, Yang Q, Wu Y (2011) Use of light detection and ranging data to identify and quantify intersection obstruction and its severity. Transp Res Rec J Transp Res Board 2241:99–108CrossRefGoogle Scholar
  40. Vatani Nezafat R, Behrouz S, Cetin M (2018) Classification of truck body types using a deep transfer learning approach. In: Paper presented at the The 21st IEEE international conference on intelligent transportation systemsGoogle Scholar
  41. Veneziano D, Souleyrette R, Hallmark S (2003) Integration of light detection and ranging technology with photogrammetry in highway location and design. Transp Res Rec J Transp Res Board 1836(1):1–6CrossRefGoogle Scholar
  42. Wang K, Wang R, Feng Y, Zhang H, Huang Q, Jin Y, Zhang Y (2014) Vehicle recognition in acoustic sensor networks via sparse representation. In: Multimedia and expo workshops (ICMEW), 2014 IEEE international conference on, 2014. IEEE, pp 1–4Google Scholar
  43. Yao W, Hinz S, Stilla U (2008) Traffic monitoring from airborne LIDAR–Feasibility, simulation and analysis. In: XXI Congress, proceedings. International archives of photogrammetry, remote sensing and spatial geoinformation sciences, Beijing, China, 2008. p B3BGoogle Scholar
  44. Zeiler MD, Fergus R (2014) Visualizing and understanding convolutional networks. In: European conference on computer vision, 2014. Springer, pp 818–833Google Scholar
  45. Zhang W (2010) LIDAR-based road and road-edge detection. In: Intelligent vehicles symposium (IV), 2010 IEEE, 2010. IEEE, pp 845–848Google Scholar
  46. Zhuo L, Jiang L, Zhu Z, Li J, Zhang J, Long H (2017) Vehicle classification for large-scale traffic surveillance videos using convolutional neural networks. Mach Vis Appl 28:793–802CrossRefGoogle Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2019

Authors and Affiliations

  1. 1.Department of Civil and Environmental EngineeringOld Dominion UniversityNorfolkUSA

Personalised recommendations