Turning traffic surveillance cameras into intelligent sensors for traffic density estimation

Accurate traffic density plays a pivotal role in the Intelligent Transportation Systems (ITS). The current practice to obtain the traffic density is through specialized sensors. However, those sensors are placed in limited locations due to the cost of installation and maintenance. In most metropolitan areas, traffic surveillance cameras are widespread in road networks, and they are the potential data sources for estimating traffic density in the whole city. Unfortunately, such an application is challenging since surveillance cameras are affected by the 4L characteristics: Low frame rate, Low resolution, Lack of annotated data, and Located in complex road environments. To the best of our knowledge, there is a lack of holistic frameworks for estimating traffic density from traffic surveillance camera data with 4 L characteristics. Therefore, we propose a framework for estimating traffic density using uncalibrated traffic surveillance cameras. The proposed framework consists of two major components: camera calibration and vehicle detection. The camera calibration method estimates the actual length between pixels in the images and videos, and the vehicle counts are extracted from the deep-learning-based vehicle detection method. Combining the two components, high-granular traffic density can be estimated. To validate the proposed framework, two case studies were conducted in Hong Kong and Sacramento. The results show that the Mean Absolute Error (MAE) for the estimated traffic density is 9.04 veh/km/lane in Hong Kong and 7.03 veh/km/lane in Sacramento. The research outcomes can provide accurate traffic density without installing additional sensors.


Introduction
Accurate and real-time traffic density is the essential input to the Intelligent Transportation Systems (ITS) with various traffic operation and management tasks [1,2].Many cities have expended considerable efforts in installing traffic detectors to obtain traffic density and other traffic-related information in recent years.However many ITS applications are still data-hungry.Using Hong Kong as an example, the current detectors (e.g., loop detectors) only cover approximately 10% of the road segments, which is not sufficient to support the network-wide traffic modeling and management framework.How to collect the real-time traffic density in an accurate, efficient, and cost-effective manner presents a longstanding challenge for not only the research community but also the private sector (e.g., Google Maps) and the public agency (e.g., Transport Department).
Various sensors and devices can be employed to estimate the traffic density directly or indirectly on urban roads.A review of existing studies on traffic density estimation is shown in Table .1. Point sensors (e.g., inductive-loop detectors, pneumatic tubes, radio-frequency identification (RFID), etc.) are widely used for traffic density estimation [3], and they are robust to environment changes (e.g., weather, light) for stable 24/7 estimation.Some advanced techniques such as Vehicular Ad hoc Network (VANET) [4] and Unmanned Aerial Vehicle (UAV) [5], can complement point sensors and contribute to traffic density estimation.However, the uniform challenge for these sensing technologies is that they may not suitable for traffic density estimation in the entire urban network due to the deployment and maintenance cost of sensors.
Traffic surveillance cameras are an essential part of an urban traffic surveillance system.Cameras are often used for visual inspection of traffic conditions and detection of traffic accidents by traffic engineers sitting in the Traffic Management Centers (TMCs).Such cameras are widely distributed in most metropolises, making it possible for large-scale traffic density estimation.
For example, in California, approximately 1,300 cameras are set up by Caltrans to monitor the traffic conditions on highways [6]; in Seoul, the TOPIS 1 system functions on 834 surveillance cameras; and in Hong Kong, the Transport Department uses about 900 surveillance cameras in its eMobility System. 2 With various camera-based traffic surveillance systems deployed globally, there is great potential to extract traffic information from camera images and videos.Combined with recent advanced technologies, several attempts have been made to vehicle information extraction (speed and count) [7,8], vehicle re-identification [9,10] and pedestrian detection [11,12].Furthermore, it is in great need to make use of the massive traffic surveillance camera data for traffic density estimation.
To look into the density estimation problem, we note that the traffic density k is computed as the number of vehicles N per lane divided by the length of a road L [13], as presented in Eq. (1).
Note that in several studies [14][15][16][17][18], the road length is assumed to be fixed and known.Hence estimating the traffic density is equivalent to counting the number of vehicles on the road.However, such an assumption has been relaxed in this study, as the road lengths in camera images are also 1 Seoul Transport Operation & Information Service.
unknown to us in different surveillance systems.Therefore, based on Eq. ( 1), the traffic density estimation from surveillance cameras can be decomposed into two sub-problems: • Camera calibration: aims to estimate the road length L from camera images, in which the core problem is to measure the distance between the real-world coordinates corresponding to the image pixels.• Vehicle detection: focuses on counting the vehicle number N , and it can be formulated as the object detection problem.
Both problems are separately discussed in the research field of Computer Vision (CV) [19,20].However, the challenges of traffic density estimation from surveillance cameras are unique.
The data collected from traffic surveillance cameras appeal to the 4 L characteristics.Firstly, due to personal privacy concerns and network bandwidth limits, the camera images are usually in Low resolution and Low frame rate.For example, in Hong Kong, the resolution of the monitoring image is 320 × 240 pixels, and all images are updated every two minutes [15].Secondly, it is onerous to annotate detailed information for each camera, and hence most of the collected data are Lacking in annotations.Thirdly, surveillance cameras distributed across urban areas are often Located in complex road environments, where the roads are not simply straight segments (e.g., curved roads, mountain roads and intersections).Overall, we summarize the challenges of the traffic density estimation using the surveillance cameras as 4 L, which represents: Low resolution, Low frame rate, Lack of annotated data and Located in complex road environments.
The 4 L characteristics present great challenges to both camera calibration and vehicle detection problems.There is a lack of holistic frameworks to comprehensively address the 4 L characteristics for traffic density estimation using surveillance cameras.To further highlight the contributions of this paper, we first review the existing literature on both camera calibration and vehicle detection.
Literature review on camera calibration.Camera calibration aims to match invariant patterns (i.e., key points) to acquire a quantitative relationship between the points on images and in the real world.Under the 4 L characteristics, conventional camera calibration faces multi-fold challenges: (1) The endogenous camera parameters (e.g., focal length) can be different for each camera and are generally unknown.
(2) Recognizing the brands and models of vehicles from low-resolution images is challenging, making it difficult to correctly match key points based on car model information; (3) Continuous tracking a single vehicle from low frame rate images is nearly impossible, which makes some of the existing algorithms inapplicable.(4) The invariant patterns in images are challenging to locate.This difficulty is caused by both the locations of the surveillance cameras (usually at the top of buildings or bridges to afford a wide visual perspective for visual monitoring traffic conditions) and the low image resolution.Even a one-pixel shift of the annotation errors (errors when annotating the key points) will result in a deviation of tens of centimeters in the real world, which fails the camera calibration (Impact of annotation errors on calibration algorithms can be referred in Appendix.A). ( 5) Existing camera calibration algorithms assume straight road segments, but many surveillance cameras locate at more complex road environments (e.g., curved roads, mountain roads, intersections), making the existing algorithms not applicable.
Existing camera calibration methods only solve a subset of the aforementioned challenges.In the traditional calibration paradigm, a checkboard with a certain grid length is manually placed under the cameras [19], and key points can be selected as the intersections of the grid.However, it is timeand labor-consuming to simultaneously calibrate all cameras in the entire surveillance system.A common method for traffic camera calibration without the need of special equipment is to estimate the camera parameters using the vanishing point method, which leverages the perspective effect.The key points can be selected either as road markings [21] or common patterns on vehicles on roads [22][23][24].These works assume that both sides of the road are parallel straight lines or all vehicles drive in the same direction.However, this assumption is invalid for complex road environments, such as curved roads and intersections, where vehicles drive in multiple directions.Hence, it is difficult to generalize the method to all camera scenarios in different traffic surveillance systems.Another alternative method is the Perspective-n-Point (PnP) method, which does not rely on vanishing points, but estimates the camera orientation given n three-dimensional points and their projected two-dimensional points (Normally n ≥ 3) in the image.Several algorithms have been proposed to solve the PnP problem [25][26][27][28][29], and they have been validated as feasible and efficient methods of traffic camera calibration using monitoring videos [30].However, the PnP method requires prior knowledge of the camera focal length, which is unknown for many surveillance cameras in realworld applications.The PnP method can be further extended to the PnPf method, which considers the focal length as an unknown variable during the calibration [31][32][33][34], but it has rarely been successfully applied to the traffic surveillance camera in practice.An important reason is that PnPf is normally sensitive to annotation errors which can lead to a completely false solution.Because the images from traffic surveillance cameras are in low resolution, the PnPf method may not be applicable.Additionally, a recently reported method [35] calibrates the camera in complex road environments without knowing the focal length, but it requires that the key points are on a specific vehicle model e.g., Tesla Model S, which is impractical for low-resolution and lowframe-rate cameras.
In summary, existing camera calibration methods may not be suitable under 4 L characteristics.The main reason is that the key points on the single vehicle cannot provide enough information for the calibration due to the 4 L characteristics.In contrast, if multiple key points on multi-vehicles are considered in the camera calibration method, the calibration results could be made more stable and robust.However, this is still an open problem for the research community.

Literature review on vehicle detection.
For vehicle detection, current solutions leverage machine-learning-based models to detect vehicles from camera images, while many Fig. 1 Challenges and limitations of existing methods for traffic camera calibration and vehicle detection challenges still remain: (1) The machine-learning models heavily rely on the annotated images for supervised training, and the labeled images are generally not available for each traffic surveillance system.(2) Vehicles only occupy several pixels in images due to the low resolution of images, making them difficult to be detected by the machine learning models; (3) during nighttime, the illumination conditions may hinder the detection of vehicles, presenting a challenge to 24/7 traffic density estimation.
Vehicle detection from surveillance cameras has been extensively studied for many years.Background subtraction was initially considered an efficient algorithm to extract vehicles from the background [14,15,18].The underlying assumption in background subtraction is that the background of multiple images is static, and can therefore be obtained by averaging multiple images.However, this assumption may be improper when the illumination intensity of different images varies significantly, such as at night or on windy days.Recent studies have focused on detection-based algorithms since they are more resistant to background changes.General object detection frameworks can be used to detect vehicles from images [20,[36][37][38], while as they are not tailored for vehicle detection, the performance is not satisfactory.In the transportation community, work [39] applied a convolutional neural network (CNN) for vehicle detection in low-resolution traffic videos; [17] combined two classical detection frameworks for accuracy consideration, and [40] extended to automatically segment the region of interest (ROI) based on optical flow.Recently, [16] generated a weighted mask to compensate for size variance caused by the perspective effect.They subsequently combined a CNN with Long-Short-Term Memory (LSTM) to exploit spatial and temporal information from videos [41].Shen et al. [42] took advantage of the K-means GIoU algorithm and then added a detection branch for fast convergence and small object detection.However, the performance of existing detection models degrade drastically when annotated data are lacking.Though the few-shot learning [43] may be incorporated to compensate for transferring the model into new scenes without or with a few annotated data, the unified performances under different camera conditions during daytime and nighttime cannot be guaranteed.To develop a generalized vehicle detection model in various surveillance systems, we augment the training data by incorporating traffic-related datasets in a systematic way.
Overall, the challenges to traffic density estimation under 4 L characteristics are summarized in Fig. 1.
For road length estimation, we aim to calibrate the surveillance camera with an unknown focal length using low-quality image slices obtained under complex conditions.For vehicle number estimation, we focus on developing a training strategy that is robust for low-resolution images acquired during both daytime and nighttime without annotating extra images.
This paper proposes a holistic framework that turns traffic surveillance cameras into intelligent sensors for traffic density estimation.The proposed framework mainly consists of two components: (1) camera calibration and (2) vehicle detection.For camera calibration, a novel method of multi-vehicle camera calibration (denoted as MVCalib) is developed to utilize the key point information of multiple vehicles simultaneously.The actual road length can be estimated from the pixel distance in images once the camera is calibrated.For vehicle detection, we develop a linearprogram-based approach to hybridize various public vehicle Two case studies with ground truth have been conducted to evaluate the performance of the proposed framework.Results show that the estimation accuracy for the road length is more than 95%.Vehicle detection can reach an accuracy of 88% during daytime and nighttime, under low-quality camera images.
To summarize, the major contributions of this paper are as follows: • It provides a holistic framework for 24/7 traffic density estimation using traffic surveillance cameras with 4 L characteristics: Low frame rate, Low resolution, Lack of annotated data, and Located in complex road environments.• It first time develops a robust multi-vehicle camera calibration method MVCalib that collectively utilizes the spatial relationships among key points from multiple vehicles.The proposed method can be used to calibrate surveillance cameras under the 4L characteristics.

Methods
In this section, we first introduce the overall framework, and the camera calibration model and vehicle detection model are then elaborated separately.

The overall framework
The framework of the traffic density estimation model is shown in Fig. 2. Camera images are first collected from public traffic surveillance camera systems, and then key points on vehicles are annotated.The camera calibration model uses the annotated data to derive a relationship between points on images and in the real world.If we can acquire the skeleton of the road, the road length can be further computed after calibration.For vehicle detection, the camera image data are fed to a deep-learning-based vehicle detection model pre-trained on a hybridized dataset, which is used to count vehicles on the road.Combining the road length and vehicle number information, we can estimate the high-granular traffic density information on the road.

Camera calibration
In this section, we present the proposed camera calibration method, MVCalib.The background about camera calibration is first reviewed, then the detailed information about the proposed camera calibration model will be elaborated subsequently.

Overview of camera calibration problems
A simplified pinhole camera model is widely used to illustrate the relationship between three-dimensional objects in the real world and the projected two-dimensional points on the camera images.Given the location of a certain point in the real world [X , Y , Z ] T ∈ R 3 , the projected point on the camera image can be represented as 2) and (3).
where Eq. ( 3) is the vectorized version of Eq. ( 2).Once the camera parameters f , R, and T are calibrated, the location of projection points on an image can be deduced from the coordinates in the real-world system.
The key points on vehicles in two-dimensional images and the three-dimensional real world are typically common features such as headlights, taillights, license plates, etc. Existing camera calibration methods assume that the key points of a specific vehicle model (e.g., Tesla Model S, Toyota Corolla) are known.Under the 4 L characteristics, camera images are too blurry for us to distinguish vehicle models.Hence, in the proposed method, a set of vehicle model candidates is built to serve as the references of three-dimensional points.The dataset of two-dimensional and three-dimensional key points for the ith vehicle in images and in the real world can be represented as where i represents the vehicle index in images and j represents the index of vehicle models.p i represents the set of two-dimensional key points of the ith vehicle on camera images, and P i j denotes the sets of three-dimensional key points of the ith vehicle in real world assuming the vehicle model is j.n and m represent the number of vehicles and the number of vehicle models in the real world, respectively.M i denotes the number of key points on the ith vehicle.
More specifically, p i k represents the location of the kth key point on vehicle i in the image, and P i j,k represents the threedimensional coordinates of the kth key point on the vehicle i assuming that the vehicle model is j.

The MVCalib method
In this section, we present the proposed multi-vehicle camera calibration method MVCalib.The pipeline of MVCalib is shown in Fig. 3. MVCalib proceeds through three stages: candidate generation, vehicle model matching and parameter fine-tuning.In the candidate generation stage, the solution candidates for each vehicle are generated separately based on conventional camera calibration methods.In the vehicle model matching stage, a specific model is assigned to each vehicle in the camera images.In the parameter fine-tuning stage, joint information on multiple vehicles is utilized to fine-tune the camera parameters.The fine-tuned value of f , R, T will be carried out to estimate the road length for the traffic density estimation.

Candidate generation.
In the candidate generation stage, we first apply the conventional camera calibration method to the key points on each vehicle, assuming that its vehicle model and the focal length of the camera are known.Mathematically, for the ith vehicle, the coordinates of M i pairs of key points in two-dimensional space p i and in threedimensional space P i j under the jth model are known.Given a default value of focal length f , the parameters of rotation matrix R and translation vector T can be estimated through the Efficient PnP algorithm (EPnP) [27] with a random sample consensus (RANSAC) strategy [44].
The EPnP method is applied to all pairs of (i, j), and hence a total number of m × n times of estimation using EPnP are conducted.The estimated camera parameters (candidates) are denoted as ψi j = f , Ri j , T i j , which represents the focal length, rotation matrix and translation vector for the ith vehicle of the jth model.

Vehicle model matching.
In the vehicle model matching stage, the most closely matched vehicle model is determined to minimize the projection error from the real world to the image plane for each vehicle i. Mathematically, we aim to select the best vehicle model j from ψi j to obtain the camera parameter ψ i for each vehicle.In the candidate generation stage, the focal length is fixed to a default value, which may contribute to errors in the projection.Therefore, in this stage, we adjust the focal length to a more accurate value and refine the parameter estimation.To this end, we formulate an optimization problem with the objective of minimizing the projection loss from the three-dimensional real world to two-dimensional camera images, as presented in Eq. ( 5).
where L v (•) defines the projection loss from the threedimensional real world to two-dimensional images for the key points on vehicles.s i j,k = R i j 3 is the scale factor for the combination of the kth key point on the ith vehicle with the jth model.R i j 3 represents the third row of the rotation matrix and T i j 3 denotes the third element of the translation vector.The focal length of a camera f i j should be greater than 0.
To solve the optimization problem L v (ψ i j ), we employ the Covariance Matrix Adaptation Evolution Strategy (CMA-ES) [45], which is an evolutionary algorithm for non-linear and non-convex optimization problems, to search for the optimal parameter ψ i j for each combination of vehicle and vehicle model.As the performance of the CMA-ES depends on the initial points, we start by searching for the parameters from ψi j .For vehicle i, we assign the vehicle model with the minimal projection loss L v , as presented in Eq. (6).
Parameter fine-tuning.In this stage, we combine the key point information on multiple vehicles and further fine-tune the information to obtain the final estimation of the camera parameters ψ.In previous stages, we made use of the key point information on every single vehicle and applied the estimated camera parameter ψ i to each vehicle i separately.Ideally, if ψ i is perfectly estimated, we can project the key points on all vehicles in camera images back to the real world using ψ i , and those key points should exactly match the key points on the vehicle models.Based on this criterion, we can select the camera parameters from ψ i and further fine-tune them to obtain ψ.
To this end, we back-project the two-dimensional points in camera images to the three-dimensional real world by using the parameter ψ i for vehicle i as an "anchor".Mathematically, given an ith vehicle, the coordinates of the kth key point on the camera image and in the real world can be represented as p i k and P i k , respectively.Note that P i k is a member of {P i j,k |, 1 ≤ j ≤ m} as the vehicle model is fixed in the vehicle model matching stage.To back-project p i k to the realworld space using ψ i , we solve a system of equations derived from Eq. 2, as shown in Eq. (7). 123 is the twodimensional coordinate of the kth key point on vehicle i in the camera images, and Pi k (ψ i ) represents the back-projected point on the ith vehicle of the kth key point given the camera parameter ψ i of anchor vehicle i .
The primary loss between back-projected points and realworld points is defined in Eq. (8).
where ξ l i, k 1 , k 2 , ψ i and ξ r i, k 1 , k 2 , ψ i represent the distance and angle loss between the back-projected points and real-world points, and α is a hyper-parameter that adjusts the weight of each loss.
are vectors that consist of any two real-world and backprojected points on the same vehicle i. k 1 and k 2 represents two non-overlapping indices of the key points on the same vehicle i.The distance loss represents the gap between the Euclidean distance of the back-projected points and one of the real-world points, while the angle loss can be regarded as the sine value of the angle between two vectors formed with the back-projected points and real-world points.We further aggregate the loss L p i, ψ i |α for different vehicles i based on their relative distance.In general, if a vehicle is further from the anchor vehicle, then the loss in the back-projected points is larger, and we have less confidence in these points.Therefore, smaller weights are assigned to vehicles that are further from the anchor vehicle.The objective of minimizing the fine-tuning loss for all vehicles is formulated to consider different weights due to the relative distance, as presented in Eq. (9).
where Ĉi is the centroid of all back-projected key points on the ith vehicle, and Ĉi is the centroid of all back-projected key points on the anchor vehicle i .ω Ĉi , Ĉi |τ is the weighting function for vehicle i using the vehicle i as an anchor.The temperature τ is a hyper-parameter that controls the distribution of the weighting function.When τ = 0, the weighting function uniformly averages the loss for all vehicles; when τ < 0, more attention will be paid to vehicles that are close to the current vehicle, and vice versa.
To obtain the final estimation of the camera parameters, we minimize the objective L f ψ i |α, τ in Eq. ( 9) for each selection of anchor vehicle.The optimal estimation is selected as that with the minimal loss, as shown in Eq. (10).

Vehicle detection
In this section, we present the vehicle detection model, which counts the number of vehicles on road segments from camera images.The state-of-the-art vehicle detection models adopt Deep Learning (DL) based methods to train the model on a vehicle-related dataset.The training process of DL models usually requires massive data.Owing to the 4 L characteristics, the quantity of annotated camera images for a specific traffic surveillance system cannot support the complete training of a modern DL-based vehicle detection model.In addition, it is inefficient to train new models for each traffic surveillance system separately.Therefore, we adopt the transfer learning scheme to first train the model on trafficrelated public datasets and then apply the model to specific surveillance camera systems [46].Existing public datasets are designed for a range of purposes, such as vehicle re-identification (reID), autonomous driving, vehicle detection, etc. [16,[47][48][49][50][51][52].The camera images in different datasets have different endogenous attributes (e.g., focal length, type of photosensitive element, resolution, etc.) and exogenous attributes (e.g., perspective, illumination, directions, etc.) Additionally, the datasets differ in size.A summary of the existing traffic-related public datasets is presented in Table .2, and snapshots of some of the datasets are shown in Fig. 4.
We categorize the camera images from these datasets into different traffic scenarios, which include the time of day (daytime and nighttime), congestion level, surrounding environment, etc.Each traffic scenario represents a unique set of features in the camera images, so if a DL model is trained for one traffic scenario, it might not perform well on a different scenario.Given the 4 L characteristics, the camera images in a large-scale traffic surveillance system may cover multiple traffic scenarios, so it is important to merge and balance  The LP hybrid dataset balances the proportion of images from each traffic scenario to prevent one traffic scenario from dominating the dataset.For example, if most camera images are captured during the daytime, then the trained vehicle detection model will not perform well on the nighttime images.If different traffic scenarios are comprehensively covered, balanced, and trained, the robustness and generality of the detection model will be significantly improved.
Following the above discussion, the pipeline for the vehicle detection model is presented in Fig. 5.
One can see that the multiple traffic-related datasets are fed into the LP to generate the LP hybrid dataset, and the dataset will be used to train the vehicle detection model.The trained model can be directly applied to different traffic surveillance systems.
As stated above, the hybrid detection dataset is formulated as an LP, the goal of which is to maximize the total number of images in the dataset, written as where u denotes the number of datasets, and v represents the number of traffic scenarios.q μ,ν are decision variables that denote the number of images to be incorporated into the LP hybrid dataset from dataset μ for traffic scenario ν.
The constraints of the proposed LP are constructed based on two principles: (1) The difference between the numbers of images from different traffic scenarios should be limited within a certain range.(2) The number of images contributed by each dataset should be similar.Mathematically, the constraints are presented in Eq. (12).
Fig. 5 The pipeline of vehicle detection where the former two constraints adjust the image contribution from different datasets, while the latter two balance the number of images from different traffic scenarios.Q μ,ν represents the total number of data for traffic scenario ν in dataset μ, and q μ,ν ≤ Q μ,ν enforces that the selected number of images should be smaller than the total number of images.β is the maximum tolerance parameter for the upper and lower bound of the image number in different traffic datasets given certain scenarios, and γ is another maximum tolerance parameter limiting the difference between the numbers of images selected from different scenarios.
the objective in Eq. (11) and constraints in Eq. ( 12), we can formulate the LP hybrid dataset that maximizes the number of data and balances the contributions of data from different datasets as well as traffic scenarios.
The vehicle detection model is built on top of You Only Look Once (YOLO)-v5, a widely used object detection model [53].YOLO-v5 is initially pre-trained on the full COCO dataset, and we adopt the transfer learning scheme

Numerical experiments
In this section, we conduct numerical experiments on the proposed camera calibration and vehicle detection methods to evaluate the performance of two traffic surveillance camera systems.

Experimental settings
To demonstrate that the proposed framework can be applied to traffic density estimation in countries with different traffic surveillance systems, two case studies of traffic density estimation are conducted, Hong Kong (HK) and Sacramento, California (Sac) where the ground true data can be obtained at both sites.A comparison of these two cameras is shown in Table .3.
• HK: Camera images in Hong Kong are obtained from HKeMobility 3 at the Chatham Road South, Kowloon, Hong Kong SAR, with the camera code K109F.Images containing seven vehicles are selected from June 22nd to June 25th, 2020.• Sac: Camera images in Sacramento are obtained from Caltrans system 4 at Capital City Freeway at E Street, Sacramento, CA, the US.Images containing seven vehicles are selected from February 17th to December 18th, 2022.
For camera calibration, we select all the vehicles that are not shadowed by other vehicles, and those vehicles are annotated with eight key points: left headlight, right headlight, front license plate center, front wiper center, left wing mirror, right wing mirror, back left corner and back right corner.Any key points not visible in an image are excluded.Besides, five popular vehicle models are involved with three-dimensional information: Toyota Corolla, Toyota Prius, Honda Civic, BMW Series 4 and Tesla Model S. The three-dimensional key points for those models are measured from the Dimensions. 5α in Eq. ( 9) is set to 6.
For vehicle detection, all of the datasets are summarized in Table .2 are incorporated.The ratio factors γ and β in Eq. ( 12) are set to 0.25.The LP hybrid dataset is divided into a training set (80%) and a validation set (20%).A total of 3,812 camera images are annotated to test the performance of the model trained on the LP hybrid dataset.
All experiments are conducted on a desktop with Intel Core i9-10900K CPU @3.7GHz × 10, 2666MHz × 2 × 16GB RAM, GeForce RTX 2080 Ti × 2, 500GB SSD.The camera calibration and vehicle detection models are both implemented with Python.For the camera calibration model, OpenCV [54] is used for computing Eq. ( 2) and running the EPnP algorithm [27].In the candidate generation stage, the focal length is fixed at 350 millimeters.The CMA-ES algorithm [45] is executed with the Nevergrad package [55].The numbers of iterations of CMA-ES in the vehicle model matching and parameter fine-tuning stages are set to 4,000 and 20,000, respectively.When tuning the vehicle detection model, we set the number of training epochs to 300, and other hyperparameters take the default settings 6 of the original YOLO-v5.The Adam optimizer [56] is adopted with a learning rate of 0.001.In the testing stage, the inference speed is more than 100 fps (frames per second).Supposed the camera images are updated every 2 min, which means the frequency of the camera image is 1/120 fps, then a sin-3 https://www.hkemobility.gov.hk/tc/traffic-information/live/cctv. 4 https://cwwp2.dot.ca.gov/vm/iframemap.htm. 5https://www.dimensions.com. 6https://github.com/ultralytics/yolov5. Fig. 6 Fine-tuning losses with parameters in the three stages for camera calibration in HK (vehicle index is defined in Eq. ( 4) gle server can simultaneously process the images from over 12,000 cameras.As surveillance cameras are only required to be calibrated once with few input parameters, the city-wide real-time traffic density estimation can be achieved using the proposed framework.

Experimental results
In this section, we compare the proposed camera calibration and vehicle detection models with existing baselines, respectively.

Camera calibration
To evaluate the performance of the camera calibration method, we first compare the fine-tuning loss defined in Eq. ( 9) among baseline models for the two cameras in HK and Sac.Based on the calibration results, we estimate the road length from the camera images, and the length estimated by each model is compared with the actual length.
To demonstrate the necessity of the three steps in MVCalib, Fig. 6 plots the fine-tuning loss defined in Eq. ( 9) for the three stages: candidate generation, vehicle model matching and parameter fine-tuning.In particular, Fig. 6a includes the losses of all the vehicle index and vehicle model pairs for the first two stages, and Fig. 6b plots the loss based on the matched vehicle model with the minimal fine-tuning loss.One can see that the fine-tuning loss defined in Eq. ( 8) decreases after each stage, which indicates that the CMA-ES can successfully reduce the loss in each stage.
We then measure the lengths of road markings on the camera images, as the road markings are invariant features on the road, and their lengths can be determined from measurements or official guidebooks.Detailed road marking information for the HK and Sac studies is shown in Fig. 7.
In Fig. 7a, the length of the white line is 1 m and the interval between the white lines is 5 ms, which are obtained from field measurements.On the camera images, a total of 14 points are annotated at the midpoints of white lines, resulting in 12 line segments of the same length (shown in Fig. 7b).Hence each line segment corresponds to 6 ms in the real world.For the camera images in Sac, we likewise use the actual lengths of the lane markings on the Capital City Freeway as the ground truth.According to the Manual on Uniform Traffic Control Devices (MUTCD) [57], the length of a white line is 10 feet (approximately 3.05 ms) and the interval is 30 feet (approximately 9.14 ms) (shown in Fig. 7c).On the camera images, we annotate 14 points resulting in 12 line segments (shown in Fig. 7d), elongated in 40 feet (approximately 12.19 ms) for each segment.
We compare our method with existing baseline models including EPnP [27], UPnP, UPnP+GN (UPnP fine-tuned with the Gauss-Newton method) [31], GPnP and GPnP+GN (GPnP fine-tuned with the Gauss-Newton method) [32].The calibration results are shown in Table .4. The estimated lengths of the road markings on camera images with the actual lengths are compared and three metrics are employed for benchmark comparison: Rooted Mean Square Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE).The calculation of MAE, RMSE, and MAPE is shown in Eq. 13.
where N rm represents the number of roadmarks, l i rm and li rm are the estimated and actual length of the ith roadmark, respectively.At each stage of MVCalib, we compare its result with baseline methods in terms of their ability to solve the PnPf problem.To conduct an ablation study gauging the contribution of each stage, we run MVCalib with only the first stage (candidate generation), with the first two stages (up to vehicle model matching), and with all three stages.The three models are referred to as MVCalib CG, MVCalib VM, and MVCalib, respectively.In fact, the MVCalib CG is equivalent to the EPnP method.
One can see from Table .4 that UPnP (GN) and GPnP (GN) yield unsatisfactory solutions owing to the low image quality.As they take the focal length into account, the complexity of the problem is significantly increased, and hence they require high-resolution images and more numerous and accurate annotation points.
As for the ablation study, we compare MVCalib CG, MVCalib VM, and MVCalib to evaluate the contribution of each stage.In the vehicle model matching stage, if we optimize the focal length with other parameters simultaneously, the estimation results are greatly improved relative to MVCalib CG, demonstrating that the estimation of focal length is necessary and important for the calibration of traffic surveillance cameras.In the full MVCalib, we also incorporate the joint information of multi-vehicle under the same camera.MVCalib achieves the best result among all models.For the surveillance camera in HK, the average error is only approximately 40 cms for estimating the six-meter road markings, less than 10% in MAPE.while in Sac, the average error is about 1 m for the forty-foot road markings, less than 10% in MAPE.
Besides, MVCalib outperforms the other models in terms of all three metrics, which means that the calibration results are close to the ground truth.Snapshots of calibration results of surveillance cameras in HK and Sac are shown in Fig. 8, where the distance between any two red dots is one meter.
Owing to the perspective effect, the distance between red dots on images appears closer when they are more distant from the camera.Through visual inspection, we note that the estimation of focal length is reasonable and the skew of perspective error is small.Additional experiments regarding the convergence and sensitivity of the MVCalib are further presented in Appendix.B, and the choice of τ is discussed in Appendix.D.

Vehicle detection
In the detection model, two traffic scenarios are considered: daytime and nighttime.A total of 76,898 images are hybridized in the LP-hybrid detection dataset after solving for the LP in Eq. ( 11) and (12).The detailed allocation of the 76,898 images is presented in Table .5.
To evaluate the generality of the vehicle detection model trained on the LP hybrid dataset, we also train the YOLO-v5 individually with the BDD100K, BITVehicle, CityCam, COCO, MIO-TCD-L, and UA-DETRAC datasets for benchmark comparison.Additionally, an integrated dataset incorporating all of the aforementioned datasets without balancing the numbers of images in the daytime and at nighttime is also considered, called the Spaghetti dataset, is also compared.Moreover, we down-sample five datasets named Random dataset whose sizes are the same to the LP hybrid dataset to ablate the influence of image number.The mean of Random dataset performance with be considered in the final results.For the model trained on each dataset, we report the vehicle detection accuracy on the testing data.Several metrics are used in evaluating the performance of the vehicle detection models, including precision, recall, AP@0.5, and AP@0.5:0.95.Interpretation about these metrics is in Appendix.C. In this paper, the threshold for IoU (Intersection over Union) is 0.45 and the threshold for object confidence is 0.25, which are the default settings in YOLO-v5.
Tables 6 and 7 present the evaluation results for the models trained with the LP hybrid and other datasets for daytime and nighttime, respectively.The model trained on the LP hybrid dataset reaches the highest recall rate and also achieves a desirable precision rate.The high recall rate means the model trained on the LP hybrid dataset is confident to find as many vehicles as possible in surveillance images.In daytime detection, the gaps between the model trained on the LP hybrid dataset and the ones trained on the Spaghetti and Random dataset based on recall rate are not conspicuous, since most images in the training set are shot during daytime.When it comes to nighttime detection, the recall rate has elevated by 2-3% compared to the model trained on Spaghetti and Random dataset, which demonstrates that the proposed hybrid strategy is adept at catching vehicles at night while maintaining a decent detection performance during daytime.For the metrics of mAP@0.5 and mAP@0.5:0.95, the model trained on the Spaghetti dataset achieves the best performance, but the gap between the models trained on the Spaghetti dataset, the Random dataset and the LP hybrid dataset for mAP@0.5 is less than 1% and the gap for mAP@0.95 is less than 2%.For images at nighttime, the model on the LP hybrid dataset outperforms those trained on the Spaghetti dataset and the Random dataset on mAP@0.5, also indicating that the proposed LP hybrid dataset can improve the detection performance at night.Moreover, the Spaghetti dataset contains more than 430,000 images, which takes more than 21 days for training.The LP hybrid dataset is a strategic sample from the Spaghetti dataset whose size is about 76,000, one-sixth of the Spaghetti dataset.It only takes 6 days to train the model on the LP hybrid dataset, but it reaches a competitive performance with the model trained on the Spaghetti dataset.The Random dataset is a random sample from the Spaghetti dataset with the same size as the LP hybrid dataset.While it maintains a decent performance in daytime detection, nighttime detection has been weakened since it contains fewer images at night.The size of the rest dataset varies a lot.However, based on the performance, the models trained on these datasets may not be able to transfer into new scenes.
In a nutshell, compared with a dataset of similar size, the LP Hybrid dataset can balance the accuracy for different scenarios (e.g., daytime, nighttime, etc.), which has important real-world implications.Compared with other datasets with larger sizes, the LP Hybrid data can achieve similar performance in different scenarios, but it takes less time for training the detection model.The LP hybrid approach is a simple and easy-to-implement approach to boost detection accuracy for rare traffic scenarios in the face of limited computational resources.

Case study I: surveillance cameras in Hong Kong
In this section, we conduct a case study of traffic density estimation using camera images on Chatham Road South, Hong Kong SAR.Given the study region, we divide the roads into four lanes (numbered along the x-axis), and define vehicle locations along the y-axis, as shown in Fig. 9.
The length of each lane can be estimated from the images using the calibration results, and the number of vehicles can be counted using the vehicle detection model.The traffic density in each lane can be estimated by dividing the number of vehicles by the length of each lane at each location and time point.To evaluate the estimated density, a highresolution (1920 ×1080 pixels per frame) camera is installed shooting the same region with different directions, and the camera video is acquired in this case study as a ground truth.The video recorded by this camera, shows the traffic conditions over 21 h from 11:30 PM, September 23rd to 8:30 PM, September 24th, 2020.
An overview of the vehicle detection results is presented in Fig. 10. Figure 10A displays a snapshot of vehicle detection using the model trained on the LP hybrid dataset of images taken in the daytime.By boxing out identical study regions in the traffic surveillance camera images and high-resolution videos (shown in Fig. 10B, C), the estimated number of vehi-  cles can be compared with the ground truth in Fig. 10D.We select four points or regions in Fig. 10D, which are shown in Fig. 10A, E, F and G. Figure 10A shows the beginning of the morning peak when the vehicle number significantly increases.Lanes #1 and #2 in the study region (numbered from the left) become visibly congested in the camera images.
Points E and F are a pair of points that depicts contrasting traffic conditions when the traffic density fluctuates dramatically in a short time interval.If we inspect images taken around 11:00 AM and 11:30 AM, respectively on September 24, 2020, which are the corresponding points E and F. In Fig. 10E, it can be seen that there are few vehicles on the road, and hence the traffic density is relatively low at point E.However, at point F, there is a sharp increase in the demand on the road.The traffic condition oscillates owing to the traffic signals downstream, which causes the pronounced changes between points E and F. Figure 10G depicts the traffic conditions at the evening peak when the vehicle number reaches the daily maximum.The evening peak fades away quickly and disappears at approximately 8:00 PM.Compared to the estimated and ground true traffic density, the developed model succeeded in tracking the growth of the morning peak and detecting the fluctuation of traffic conditions.However, at point G, some of the vehicles are miss-detected in the evening peak.This may have been Fig. 11 The variation of traffic density from 00:00 AM to 9:00 PM in HK Fig. 12 The location of the matched surveillance camera and double-loop detector in Caltrans caused by dazzles from the headlights and the light reflected from the ground, which make it difficult for the vehicle detection model to identify the features of vehicles.This phenomenon is a common issue in Computer Vision (CV), which will be left for future research.Overall, the estimated result is close to the ground truth most of the time, which demonstrates that the detection model can accomplish an accurate detection despite the low resolution and low frame rate of the images.
The RMSE, MAE, and MAPE of the estimated traffic density for each lane and the entire road are presented in Table .8, and a comparison of traffic densities from estimation and the ground truth is shown in Fig. 11.
One can see that the estimated density approximates the actual density, and the density fluctuation is accurately captured.The MAPE is relatively high because this metric is sensitive when the density is small.For example, if the true density is 2 veh/km/lane, while the estimated density is 1 veh/km/lane, then the MAPE is 50%.The traffic density of Lane #1 is overestimated with an MAE of approximately 12 veh/km/lane, while the traffic density of Lanes #2, #3 and #4 are underestimated with an MAE of approximately 9, 7, and 6 veh/km/lane, respectively.The possible causes of the under estimation and over estimation are two-fold: (1) The frame rate is not sufficient enough to support an individual esti-mation for each lane.Since the image will be updated once every two minutes, an average of 7.5 images will be accumulated in a time interval of 15 min.The estimation may result in a biased estimation since the small-size samples happen to capture the non-recurrent patterns of the traffic density.
(2) The determination of the lane of each vehicle may be biased, as the lane occupied by each vehicle is determined by the center of the bounding box.When the road is curved in images and the vehicle is large, the center of the bounding box may shift to another lane, affecting the accuracy of the estimations in both lanes.

Case study II: surveillance cameras in Sacramento
To demonstrate the generality of the proposed framework, another case study is conducted using a surveillance camera in the Caltrans system.The monitoring video data is collected from the camera on the Capital City Freeway, Sacramento, CA (shown in Fig. 12

left).
A 24-hour video is downloaded, covering the period from 12:00 AM on February 17th to 12:00 AM on February 18th, 2022.Similar to the procedures for HK, key points on vehicles are annotated manually for camera calibration.The ground Fig. 13 The variation of traffic density in 24 h in Sac true density data are obtained from a double-loop detector at the same location (shown in Fig. 12 center) within the same time period.The detector data are obtained from the PeMS system, which includes the average traffic speed, density, and flow data.Given the study region, we can also divide the roads into three lanes (numbered along the x-axis), and define vehicle locations along the the y-axis, as shown in Fig. 12 right.The accuracy of the estimated traffic density is shown in Fig. 13.The blue curve represents the estimated density, while the orange curve represents the ground truth.The gaps between these curves are small from visual inspection, and the proposed framework can successfully detect sudden changes happened in traffic density.Furthermore, the estimation accuracy does not deteriorate at nighttime.In this scenario, the sunset ended about 6:00 PM, but the proposed framework can follow the evening peak even after 6:00 PM.To evaluate the estimation accuracy in transition time from free-flow to congestion or from congestion to freeflow regimes, we divide the 24 h into two density regimes, transition time and non-transition time.The transition time is from 11:00 AM to 1:00 PM and from 5:00 PM to 7:00 PM.The non-transition time consists of rest time intervals.From Table.9, the average MAPE in transition time and non-transition time are 20.63% and 18.31%, respectively.Though the MAPE in transition time is 2% higher than that in non-transition time, MAPEs remain at the same level in transition time and non-transition time, meaning that the method can capture the transition in traffic density.

Conclusions
In this paper, we propose a framework for traffic density estimation using traffic surveillance cameras with 4 L characteristics, and the 4L represents Low frame rate, Low resolution, Lack of annotated data, and Located in complex road environments.The proposed density estimation framework consists of two major components: camera calibration and vehicle detection.For camera calibration, a multi-vehicle calibration method named MVCalib is developed to estimate the actual length of roads from camera images.For vehicle detection, the transfer learning scheme is adopted to fine-tune the deep-learning-based model parameters.A linear-program-based data mixing strategy that incorporates multiple datasets is proposed to synergize the performance of the vehicle detection model in different traffic scenarios.
The developed camera calibration and vehicle detection models are compared with existing baseline models in terms of the performance on real-world surveillance camera data in Hong Kong and Sacramento, and both models outperform the existing state-of-the-art models.The MAE of camera calibration is less than 0.4 ms out of 6 ms, and the accuracy of the detection model is approximately 90%.We further conduct two case studies in Hong Kong and Sacramento to evaluate the quality of the estimated density.The experimental results indicate that the MAE for the estimated density is 9.04 veh/km/lane in Hong Kong and 7.03 veh/km/lane in Sacramento.Comparing the estimation results in the two study regions, we also observe that the performance of the proposed density estimation framework degrades under low-quality images and high-illumination-intensity environments.Considering the robustness of surveillance cameras and the estimation accuracy, we think the performance is acceptable for current transport industries.The proposed framework has great potential for large-scale traffic density estimation from surveillance cameras in cities across the globe and it could provide considerable and fruitful information for traffic operations and management applications.
In future research, we would like to extend the proposed framework to estimate other traffic state variables such as speed, flow, and occupancy.In the camera calibration method, the key points of each vehicle are manually labeled, which can be further automated [30].In addition to the vehicle detection model, a vehicle classification model could also be developed to estimate traffic density by vehicle type.Moreover, it would be of practical value to develop a fully automated and end-to-end pipeline to deploy the proposed density estimation framework in different traffic surveillance systems.
Fig. 14 The calibration losses in the vehicle model matching stage for the camera in HK.The x-axis is the iteration of optimization, while the y-axis means the loss Fig. 15 The calibration losses for the camera in the parameter fine-tuning stage in HK.The x-axis is the iteration of optimization, while the y-axis means the loss formulated, respectively, and there is no guarantee to find the global optimal.Since the gradients of variables are difficult to formulate in these optimization problems in the vehicle model matching and parameter fine-tuning stage, we consider a non-gradient-based optimization algorithm, named CMA-ES to solve these problems.The calibration losses for the camera in HK in the vehicle model matching and parameter fine-tuning stage are shown in Figs. 14 and 15.It is clear to see that most losses with different combinations of vehicles and vehicle models converge in the vehicle model matching stage, and all losses converge to the local minimum in the parameter fine-tuning stage.The patterns of calibration losses in Sac for both vehicle model matching and parameter fine-tuning stages are similar to those in HK, so we do not discuss them respectively.

B.2 Sensitivity
The robustness of the MVCalib algorithm is important in real-world applications.In this section, we discuss the impact of initial parameters on the MVCalib algorithm.The MVCalib algorithm consists of three-stage calibrations.In the candidate generation stage, the problem is not solved by an optimization algorithm.Hence there is no initial parameter in this stage.In the vehicle model matching and parameter fine-tuning stage, problems are solved by two optimization algorithms.The initial parameters may individually or mutually affect the calibration results.Hence, we set up three groups of experiments to figure out the influence of initial parameters.
In the first group, we only change the initial parameters within the scale of 10%, 20% and 30% in the vehicle model matching stage.We repeat the process 5 times.In the second group, we only change the initial parameters in the parameter fine-tuning stage within the same scale 5 times.In the third group, both initial parameters in the vehicle model matching and parameter fine-tuning stages are changed 5 times within the same scales.Table .11 shows the average results within the same scale in each group.It can be seen that, when the change scale is less than 20%, the calibration results do not degrade significantly in most groups in HK and Sac.When the change scale is over 30%, RMSE, MAE and MAPE are nearly doubled in most groups in HK and Sac, meaning that the camera calibrations results are not reliable.Therefore, the MVCalib algorithm can sustain the vibration of initial parameters within 20% of the original scale.introducing the concept of the above metrics, there is a prerequisite metric called intersection over union (IoU), which defines the gaps between the estimated objection location and the ground truth.The outputs of the detection model are twofold.One is four corner coordinates that locates the object position in the image.The other is the confidence probability of the belonging category.If we overlap the estimated and the ground true bounding boxes, there will be an area of intersection (shown in Fig. 17a) and an area of union (shown in Fig. 17b), where the red and green rectangle means the estimated and ground true bounding boxes of an object, and the blue rectangle shows the intersection and union area, respectively.The intersection over union is defined as the quotient of the intersection area over the union area.
A threshold for IoU is set to decide if the bounding box is real or fake.If the IoU exceeds the threshold, we label it as True Positive (TP).Moreover, we can divide all circumstances into three categories, True Positive (TP), False Positive (FP), and False Negative (FN).The illustration about these circumstances is shown in Table 12.
Additionally, the precision and recall can be calculated as The precision and recall are a a pair of contradictory metric.When the precision is high, the recall is relative low,   The IoU between predicted and ground truth exceeds the threshold False Positive (FP) 1.The IoU between predicted and ground truth is smaller than the threshold.

Estimated bounding boxes not overlapping with any ground true bounding boxes
False Negative (FN) The object is not detected by the algorithm vice versa.If we rank all the detection results according to the confidence probability, set different thresholds for confidence probability and re-calculate the precision and recall, a precision-recall (PR) curve can be plotted where the x-axis is the recall and the y-axis is the precision.An example is shown in Fig. 18.The Precision-Recall curves of different vehicle detection models are shown in Fig. 18.
With the increasing of confidence threshold, the recall enlarges while the precision reduces.If the curve is close to the upper right corner of the figure, the performance of the model is good.Hence, it can be seen that the detection mod- (c) The PR curve for images throughout day and night.Fig. 18 The PR curve for camera images on the testing set els trained with Spaghetti and LP hybrid datasets outperform other models trained with sole datasets.In particular, if we compare the curve between models trained with Spaghetti and LP hybrid datasets, the differences between these two models are marginal.
The Average Precision (AP) is the area that below the PR curve, calculated as where r is the recall and P R(r ) is the precision.The mAP@0.5 means the AP value when the IoU threshold is 0.5.Besides, the mAP@0.5:0.95means the average of AP when the IoU threshold equals to 0.5, 0.55, 0.9, • • • , 0.9, 0.95 separately.These two metrics are extensively used to evaluate the performance of algorithms in object detection tasks in CV.

Fig. 2
Fig. 2 Framework of traffic density estimation with surveillance cameras

•
It systematically designs a linear-program-based data mixing strategy to synergize image datasets from different cameras and to balance the performance of the deep-learning-based vehicle detection models under different traffic scenarios.• It validates the proposed framework in two traffic surveillance camera systems in Hong Kong and Sacramento, and the research outcomes create portals for rapid and massive deployment of the proposed framework in different cities.

⎣ r 11 r 12 r 13 r 21 r 22 r 23 r 31 r 32 r 33 3 ⎤⎦
the endogenous camera parameters, where f denotes the focal length of the camera.w and h represent the width and height of images.R = ⎡ are the rotation matrix and translation vector of the camera, respectively.Hence, [ f , R, T ] ∈ R 13 are the 13 parameters to be estimated in the problem of camera calibration.

Fig. 3
Fig.3The pipeline of the MVCalib method for camera calibration

Fig. 4 A
Fig. 4 A glance of various traffic-related image datasets used in this study

Fig. 7
Fig. 7 Driving lanes from real world and camera images in HK and Sac

Fig.
Fig. Driving lanes for traffic density estimation beneath the surveillance camera

Fig. 10
Fig. 10 Overview of the vehicle detection results in HK

Fig. 16
Fig.16 The calibration losses with different τ in HK and Sac (a) Area of intersection.(b) Area of union.

Fig. 17
Fig. 17 Illustration of the intersection over union (IoU) (a) The PR curve for images at daytime.(b) The PR curve for images at nighttime.

Table 1
A review of emerging sensing technologies for estimating traffic density

Table 3
The comparison for the selected traffic cameras used for case studies in Hong Kong and Sacramento to inherit the pre-trained weights and tune the weight parameters on the LP hybrid dataset.The YOLO-v5 network is a general framework for detecting and classifying objects simultaneously.In the vehicle detection context, we only need to box out the vehicles from the background images regardless of vehicle type.Hence we reshape the output dimension into one with randomly initialized parameters.As the LP hybrid dataset contains camera images in various traffic scenarios, we can build a generalized detection model suitable for various traffic surveillance systems in different countries.

Table 4
The comparison of results of surveillance camera calibrated by different methods in HK and Sac (unit for RMSE and MAE: meter) Snapshots of calibration results in HK and Sac

Table 6
Evaluation results for different detection models on images during the daytime

Table 7
Evaluation results for different detection models on images during the nighttime

Table 8
The RMSE, MAE and MAPE of the estimated density for different lanes from surveillance cameras in HK (unit: RMSE, MAE: veh/km/lane)

Table 9
The RMSE, MAE and MAPE of the estimated density in different lanes from surveillance cameras in transition and non-transition time in Sac (unit for RMSE, MAE: veh/km/lane)

Table 11
The sensitivity of initial parameters in HK and

Table 12
Illustration of TP, FP, and FN