Introduction

Challenges in Applying Deep Learning Methods

Artificial intelligence (AI) systems are characterised by their capability to acquire their own knowledge, by understanding patterns from raw data (Goodfellow et al. 2016). The two key components of AI systems are (1) data from sensors, and (2) algorithms/models that translate the data into useful information. If we consider the above-mentioned components, we could say that transportation researchers and practitioners have long been working on AI applications. For example, to control traffic signals, a number of loop detectors have been installed and signals have been automatically controlled following certain models and algorithms in many cities from developed countries (Zhao et al. 2011). Loop detector and global positioning system (GPS) data have also been utilised to provide route guidance information together with the algorithm translating raw data into travel-related information. Thus, AI applications are not new in transportation field. However, the landscape is going to change drastically, mainly because of (1) dramatic increase in data streams from various sensors, and (2) rapid development of machine learning techniques including deep learning.

Conventional machine learning techniques such as random forest method, support vector machines (SVM), and shallow neural network methods have been commonly used in the field of transportation. They have been applied earlier in the fields of traffic state prediction such as traffic speed and flow (Do et al. 2019; Wang and Shi 2013), travel time prediction (Wu et al. 2004), bus arrival time prediction (Bin et al. 2006), transportation mode extraction (Shafique and Hato 2015) among other applications. The last decade has witnessed a rise in the availability of big data in the field of transportation. Data from various sources such as GPS, loop detectors, closed circuit television (CCTV) cameras are being increasingly used in transportation analysis. However, analysing data pertaining to transportation involves the disentanglement of multiple factors of variation. For example, for the prediction of driving behaviour using CCTV or any camera footage, the factors of variation might include several high-level, abstract features such as age and gender of the driver, the various sitting position they can drive in, and the different types of activities they can perform while driving. Achieving high accuracy in such complex situations is a major challenge and deep learning methods are becoming more popular in such scenarios.

Goodfellow et al. (2016) argued that in cases where a nearly human-level understanding of the data is required, representation becomes a major challenge. Deep learning models provide an excellent way to solve this problem by building complex representations based on a combination of simpler representations. Multiple layers can be added to represent complex and abstract features (LeCun et al. 2015), thereby improving the overall accuracy levels. However, with respect deep learning’s use in transportation, there remains at least three challenges for researchers and policy-makers. First, the black box functionality of deep learning models is a major barrier in linking these models with the existing transport theories. Policy-makers from across the world have argued on the logic and the reasoning behind using deep learning models (European Parliamentary Research Service 2019; Ministry of Internal Affairs and Communications 2019). Improving accuracy cannot be the lone goal of such models and in principle, these models should be rooted in existing theories. For example, Chikaraishi et al. (2020) argued that, in short-term traffic state prediction, a machine learning model which produces the best prediction accuracy is not always the best for practical use since it does not mimic the mechanisms of congestion occurrence. Second, these methods are cost and resource intensive, with a set of influences from external factors such as type of data and sample size. Traffic forecasting studies, for example, require a huge cost in collecting observed data, as a practical application often needs the collection of city-scale data. Unlike other practical applications of AI, where data could be obtained from a laboratory or a plant, transportation studies offer a unique challenge with respect to controlling external factors. Third, the decision of choosing appropriate methods often require modellers to adopt a try-and-error approach. The choice of the type of neural network structure would influence the prediction accuracy, which would further depend on the variable being predicted. With multiple areas of applications in transportation, this could get very challenging.

This study intends to contribute to the second and third challenges through a review on the relationship between external factors and the prediction accuracy of deep learning models. A proper summary of existing findings would provide a good guide to reduce burdens in the try-and-error model tuning process, while accounting for other factors such as type of data and sample size.

A Brief Overview of Existing Studies

Currently, both computer scientists and transportation engineering professionals have applied deep learning methods to predict complex transportation phenomena, and their use in transportation studies is rapidly increasing. Initial studies mostly focused on image detection pertaining to the detection of traffic signs (CireşAn et al. 2012), vehicles (Chen et al. 2014), and pedestrians (Ouyang and Wang 2013). However, lately deep learning methods have been applied in many studies analysing complex variables such as traffic state prediction (Bai and Chen 2019; Jo et al. 2019), travel demand estimation (Tang et al. 2019a, b), mode choice and activity choice prediction (Zhao et al. 2019a, b, c, d, e, f).

LeCun et al. (2015) categorised the deep learning methods into three major groups, (a) multilayer architecture using backpropagation (Bengio et al. 2007), (b) convolutional neural networks (CNN) (Simonyan and Zisserman 2014), and c) recurrent neural networks (RNN) (Graves et al. 2013). The use of a type of method depends on the field of application. For example, CNNs have been successful in the analysis of image data and the recognition of its features. Meanwhile, RNN methods have been useful in text and word recognition and in data which required processing of values in sequences (LeCun et al. 2015). Moreover, there exists significant levels of variations within a type of method, for example, the classical RNN structures have evolved into gated RNNs such as the long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997) and the gated recurrent unit (GRU) models (Chung et al. 2014). There are various travel-related variables that could be predicted using deep learning models and the choice of method would significantly influence the prediction accuracy.

Review studies on the application of deep learning in transportation have focused on identifying the areas of application (Nguyen et al. 2018; Wang et al. 2019) and the different kinds of methods (Do et al. 2019). Wang et al. (2019) identified that deep learning models in transportation have been applied mostly for either classification of discrete states or regression of continuous real values. They identified the areas of application, while describing the advantages and disadvantages of the methods. However, the decision on selecting the type of method and their accuracy would depend upon other external factors such as sample size, area of application, region of data collection (whether from urban area, rural area, or both), and time horizon of prediction (for example, short-term prediction or long-term prediction). Previous review studies did not study accuracy’s relationship with respect to the area of application, the type of deep learning method and other external factors such as type and source of data, data coverage, sample size, and time horizon of prediction.

Objectives

To fill in the above-mentioned research gaps, this paper aims (a) to identify the set of deep-learning methods used with respect to the area of application, type of data collected, coverage of the study, time horizon of prediction, and sample size; (b) to statistically determine the relationship of these external factors with prediction accuracy through a meta-analysis. We believe that this review paper will contribute to the existing literature on the application of AI in transportation in two major ways. First, as deep learning is a relatively new field and the number of papers being published have been continuously increasing every year, an extensive survey will cover new studies which were not reviewed in earlier conducted similar works (Do et al. 2019; Nguyen et al. 2018; Wang et al. 2019). Second, it will add on to the understanding of how various factors pertaining to the methodology, data type, coverage area, time of prediction, or sample size are associated with the accuracy. The knowledge about the same will be beneficial for researchers and transportation practitioners who will be able to control for these factors in future studies.

This article is organised as follows. Section 2 describes the research methodology. It has two sub-sections. The first sub-section details out the study selection and search strategy employed to select the papers that were reviewed. Meanwhile, the second sub-section discusses the approach adopted for the meta-analysis. Section 3 discusses the descriptive statistics of the review analysis, which is followed by the discussions of the results of the meta-analysis in Sect. 4. Finally, we conclude our work in Sect. 5 by discussing the key findings, contributions, limitations, and the directions for future research in the field of AI and transportation.

Research Methodology

This section describes the investigation methodology adopted for this study and is further divided into two sub-sections. Section 2.1 illustrates the search strategy implemented to select the papers which used deep learning in transportation analysis. Meanwhile, Sect. 2.2 describes the method and the set of variables considered in the meta-analysis.

Search Strategy

The literature for the review was selected in two steps. First, a generic search on the web of science (WoS) database helped in identifying three review papers already published, which reviewed deep learning studies in the field of transportation. The reference section of these review papers was searched to identify the relevant literature. As a result, a total of 72 unique studies were identified (see Table 1). A conscious decision was made to limit our review to only those studies which analysed travel-related or travel behaviour variables. Therefore, studies which employed deep learning models for vehicle, traffic sign, pedestrian, and cracks in pavement detection were not included in this review. The mentioned areas of application mostly implemented widely available image-based datasets and typically employed different variants of the CNN model (Li et al. 2016; Luo et al. 2014; Qian et al. 2015). The three review studies, viz. Wang et al. (2019), Do et al. (2019) and Nguyen et al. (2018) succinctly summarised the work done so far based on the areas of application, which informed the second step of the search strategy (Wee and Banister 2016). Relevant keywords were identified and systematically searched using the WoS database. The list of keywords is provided in Table 1. Four separate indexes (1) Science Citation Index Expanded (SCI-EXPANDED), (2) Social Sciences Citation Index (SSCI), (3) Arts & Humanities Citation Index (A&HCI), and the (4) Emerging Sources Citation Index (ESCI) were searched. In addition, only articles published in English were considered for analysis. Table 1 lists out the resulting number of papers from the searches after excluding the common papers identified in step 1. A total number of 106 additional papers were identified using the keyword searches, making the total number of papers that had to be reviewed to 198. However, as the major objective of this review is to conduct a meta-analysis, testing the effect of different variables on accuracy levels, studies which did not report any indicator measuring accuracy were not considered for the review. Different studies reported different indicators that showed the accuracy of the methods. Many studies directly mentioned the accuracy percentages (Gu et al 2019a, b), meanwhile, others reported the error percentage in the form of mean absolute percentage error (MAPE) (Bao et al. 2019a), mean relative error (MRE) (Zhao et al. 2017), or root mean square error % (RMSE %) (Jo et al. 2019). The error values were used to then calculate the accuracy levels (100 − error%). In addition, some studies also mentioned recall rate (Zhu et al. 2019), R2 (Polson and Sokolov 2017), and area under curve (AUC) values (Singh and Mohan 2018), which indicated the accuracy levels. Studies which did not report any of the above-mentioned indicators (see Table 1) were omitted from the list of studies to be reviewed. In addition, there were studies which utilised simulated data for their analysis (Gang et al. 2015), they were also removed from the final list of studies to be reviewed. The final tally of papers which were then considered and reviewed for the meta-analysis was 136. The next section describes the review process and the methodology adopted for the meta-analysis.

Table 1 Search strategy for literature review

Meta-analysis

To conduct the meta-analysis, information under nine separate heads were extracted from the papers short-listed for the review (N = 136). Table 2 lists out the factors extracted. They included recording the (1) year of publication, (2) country where the study was performed or where the data belonged to. (3) The region, i.e. whether an urban area or rural area or both, from where the data were collected. (4) Source of data, (5) the type and format of data used for the analysis, (6) total number of samples considered for the study, (7) the time horizon of prediction, (8) the method used for analysis, and finally (9) the prediction accuracy of the method. The information was collated for all 136 studies which resulted in the collection of 2314 unique rows of information. Each row denoting a method analysed for a particular area of application.

Table 2 Factors extracted from the literature review and the variables used for meta-analysis

The meta-analysis on the prediction accuracy was carried out using linear mixed effects models (Laird and Ware 1982), which accounted for the random effects representing unobserved heterogeneity in prediction accuracy across studies. Introducing the random effects is essential for the meta-analysis since the accuracy level would depend not only on variables introduced, but also on other unobserved variables such as quality of sensors and data cleaning process. Percentage accuracy was considered to be the dependent variable. Meanwhile, fixed effects were estimated for different variables denoting the type of method used, area of application, source of data, region of data collection, sample size, and time horizon of prediction.

Four separate models were developed as the information on sample size and time of prediction were not available for all studies and because different studies used different indicators to measure the prediction accuracy. Model A to C, which did not consider sample size and time of prediction as explanatory variables considered 136, 86, and 29 studies, respectively (N = 2314, 1878, and 220, respectively) and tested the effect of other variables listed in Table 2. Model D was developed exclusively for studies which analysed traffic-related applications such as flow and speed and which used the MAPE, MRE, and average accuracy indicators to estimate prediction accuracy (N = 991, studies = 36). This model also tested the impact of sample size and time horizon of prediction on the prediction accuracy. The linear mixed effects models were developed in R using the lme4 package (Bates et al. 2014). The results of the meta-analysis are discussed in Sect. 4.

Findings from the Literature Review

As mentioned in the previous section, the information from the papers were extracted and then collated under different heads and this section is dedicated towards describing those factors. It is aimed at providing a detailed overview of the studies, the type of methods, their coverage, and the areas of application. In addition, this section would also discuss the average accuracy observed across different methods and areas of application.

Areas of Application

The review of the literature showed that deep learning methods were used in ten distinct transportation related fields. The most common areas of application belonged in the field of traffic forecasting. Traffic flow forecasting was analysed in 42 studies. Meanwhile, 26 studies used deep learning methods to predict traffic speed (see Table 3). In addition, one study predicted the road occupancy levels (Zang et al. 2017a, b). Travel demand prediction was also observed to be popular with 19 studies estimating the travel demand in different travel modes such as bus (Baek and Sohn 2016), bus rapid transit systems (BRT) (Liu and Chen 2017a), car sharing (Zhu et al. 2017), mass rapid transit systems (MRT) (Liu and Chen 2017b), taxis (Xu et al. 2018a, b, c, d), trains (Tang et al. 2019a, b), for parking (Yang et al. 2019), and bike sharing (Xu et al. 2018a, b, c, d). Moreover, deep learning methods were also used to estimate the travel demand between origins and destinations (Cheng et al. 2017). Prediction of congestion was analysed in 11 different studies. Meanwhile, traffic accidents were predicted in 12 studies, out of which only one study focused on railway accidents (Feng et al. 2018). Meanwhile, others analysed road traffic accidents. Driver behaviour, which included the prediction of distracted driving (Eraqi et al. 2019), subjective risk perception while driving (Ping et al. 2018), lane changing behaviour (Dou et al. 2018), and braking behaviour (Christopoulos et al. 2018) was predicted in 17 studies. Meanwhile, travel behaviour such as mode choice and activity classification was predicted using deep learning methods in five (5) studies. Mode choice was predicted in three studies. Whereas, activity state classification was analysed in one study (Cui et al. 2018a, b, c). One study estimated both mode and activity classification together (Zhao et al. 2019a, b, c, d, e, f). Travel time was predicted in seven (7) studies, which also included one study which predicted both travel time and travel distance together (Jindal et al. 2017) (see Table 3).

Table 3 Area of application and year of publication

Year-Wise Distribution of Studies

The review of the literature clearly showed a steady increase in the use of deep learning methods in the field of transportation. Table 3 shows how the use has increased from merely two (2) studies in 2014 to 59 studies in 2019. Comparison of publications based on the area of application shows a rising trend in the fields of accident analysis, driver behaviour prediction, travel time prediction, and traffic state prediction.

Types of Data Source

Loop detectors were observed to be the most commonly used data source, with 55 studies utilising the data obtained from them. Loop detectors are traffic sensors which are usually installed on roads or at toll-stations to detect vehicles. Using the data, one can estimate the flow, speed, and occupancy of the road segments. The frequency at which a typical detector records and relays information varies. Some studies use information recorded every 30 s, while others use 1-min or 5-min interval records. Researchers often aggregate the data to predict the traffic state in the short term. 15-min aggregation was observed to be common among different studies. Data extracted using GPS was observed to be second most common source, with 38 studies using it. Data from mobile phones and vehicle based intelligent transport systems (ITS) both provided GPS information. The information was often in terms of trips made and their trajectories (Bao et al. 2019a; Ma et al. 2015a, b). The same data source was also used to extract information related to speed (Zhao et al. 2019a, b, c, d, e, f), travel time (Petersen et al. 2019), activity state information (Cui et al. 2018a, b, c), and congestion (Chen et al. 2016a, b). GPS is technically an external source of information, whereas mobile phone sensors such as accelerometer, magnetic, gyroscope, barometer (AMGB) can be classified as internal sources. Five studies could be identified which used data from AMGB. They have been utilised in the areas of mode choice prediction (Qin et al. 2018), braking behaviour (Christopoulos et al. 2018), and congestion analysis (Tu et al. 2017). Image-based data collected from CCTV footage or other cameras (including LIDAR sensors) were also commonly used in 22 different studies. Meanwhile, data from mobile phone-based applications and platforms were used in 10 studies. The data from this source were mostly used to predict travel demand (Ke et al. 2017a, b) and mode and activity states (Zhao et al. 2019a, b, c, d, e, f). Nine studies also used external accident data. Meanwhile, eight (8) studies used the information collected by automated fare collection (AFC) devices fitted in public transport systems (Liu et al. 2019) or parking stations (Yang et al. 2019) (see Table 4). The data collected from AFC devices was used for predicting travel demand in bus and MRT systems (Baek and Sohn 2016; Liu and Chen 2017b). In addition, it was observed that nine studies used certain other sources of data which involved household surveys (Cui et al. 2018a, b, c), toll-based tag data (He et al. 2019), and car based ITS (Jo et al. 2019). Finally, it was observed that in addition to a primary source of data, 40 studies also utilised secondary information such as road network attributes (Zhu et al. 2019), weather information (Xu et al. 2018a, b, c, d), population data (Bao et al. 2019a), and land use data (Baek and Sohn 2016) for the training of their models. It should be noted that many studies utilised more than one data source for their analysis and therefore, Table 4 has multiple entries for a single study.

Table 4 Type of data sources used in the studies

Coverage and Region of Studies

The distribution of the studies based on their country of coverage showed that a very high number of studies came from primarily two countries (see Table 5), China (54) and United States of America (USA) (45). A possible reason behind this might be the readily available data in these countries. In addition, it was observed that there were 11 studies which were based in United Kingdom. Meanwhile, four (4) studies utilised data from South Korea for the analysis. Netherlands, Japan, and India showed two studies each, whereas Australia, Canada, Denmark, Egypt, Germany, Greece, Hong Kong, Malaysia, Morocco, Norway, Palestine, Poland, Taiwan, and Uganda each showed one study based on their data. It should be noted that many studies utilised data from more than one country (Eraqi et al. 2019; Qin et al. 2018) and ten studies did not mention any country where the data were based in (Tran et al. 2018).

Table 5 Country and region coverage of the studies

The information on the region from where the data were collected showed that 75 studies were based on data specifically collected from urban areas. Meanwhile, 48 studies collected data from both urban or rural areas. In cases of loop detector data, unless it was specifically mentioned that the data were only collected from a city or an urban area, it was assumed that the data included information from both urban and rural areas. Moreover, in 13 studies there was no specific mention of the region from where the data were collected.

Accuracy Indicators

The analysis of the studies showed that 13 different type of indicators were used across 136 papers. A majority of them used either the mean absolute error percentage (MAPE) (58 studies) or mean relative error (MRE) (25 studies) indicators. Both MAPE and MRE essentially have the same formulation but the difference is that MAPE is expressed in percentage, whereas MRE is expressed in proportions (see Table. 6). In 2 out of the 58 studies which used MAPE, the studies did not utilise all observations to calculate MAPE. Ke et al. (2017a, b) in their study on estimating travel demand in taxi services, calculated MAPE for values which had a demand intensity of greater than 10, these samples represented the top 4.45% of all samples. In addition, Bao et al. (2019b) utilised the top 5% of the samples with highest values to estimate MAPE. A total of two studies used a symmetric MAPE (sMAPE) indicator to calculate error values. For example, in Xu et al. (2018a, b, c, d), the denominator (see Table 6 for formula) contains additional parameters for predicted values and a constant (\(c\) = 1 in their study) to avoid a zero denominator. In addition, five studies used an average accuracy indicator obtained from subtracting MAPE or MRE from 100 or 1, respectively.

Table 6 Accuracy indicators used in the studies

Other commonly used indicator includes a straightforward accuracy measure (29 studies), which is ratio of number of correct predictions to the total number of predictions (often multiplied by 100 to convert into percentage). This indicator has been mostly used in discrete classification studies (such as driver behaviour) as opposed to predicting a continuous value (like traffic flow). Opposite to the accuracy indicator, one study used the error rate indicator, where they estimated the ratio between number of incorrect predictions to the total number of predictions. Duives et al. (2019) forecasted prediction movements using GPS trajectory data and predicted the movements in the adjoining cells. They estimated the error rates for the 1st, 5th, and 20th prediction of the sequences. Other accuracy indicators in classification studies include recall rate (5 studies), precision (2 studies), and area under the receiver operating characteristics curve (AUC) (3 studies). Meanwhile, a few studies predicting continuous values used root mean square percentage (4 studies) and R2 (1 study). Since most indicators provide a ratio or proportion roughly denoting the accuracy or error, they were converted to represent an accuracy percentage for the meta-analysis [e.g. \(\mathrm{accuracy}=100\times (1-\mathrm{MRE}\))]. In addition, additional models were created with only (a) MAPE, MRE, and average accuracy indicators and (b) accuracy indicator. The findings of each model were then compared for further discussions.

Deep Learning Methodologies

The review of the literature showed that 136 studies used a total of 2314 methods to predict different transport related variables. The 2314 times these different methods have been tested are also henceforth referred to as cases in this study and they include the use of 11 different groups of deep learning methods along with traditional (TM) and shallow neural network (SNN) methods (see Table 6). Traditional methods in this study are classified as machine learning methods which do not use neural networks. An array of such methods was extensively tested in different studies (primarily to be compared with deep learning models). These methods included autoregressive integrated moving average method (ARIMA), vector autoregression method (VAR), random forest method (RF), support vector machines (SVM), and XGBoost (XGB) among many other methods including variations of the methods mentioned above. As the focus of this study is primarily on understanding the use of deep learning methods, all different traditional methods have been clubbed into one group. Such an aggregation of non-deep learning methods provides a wider scope to evaluate all different deep learning models in the meta-analysis. However, we realise that aggregating all such methods with different characteristics is a major limitation of our study. Similarly, the SNNs also have been grouped together to generate a baseline for the inference. SNNs denote all those neural networks with a shallow architecture. Table 6 shows the different type of methods based on the area of application. A total of 113 cases were estimated using deep neural networks (DNN) based on feed-forward networks. Wang et al. (2019) classified these models as deep multilayer perceptron (MLP) and discussed in detail about their differences with stacked auto-encoders (SAE) and deep belief networks (DBN). SAE and DBN were used to predict 222 and 114 cases, respectively. Recurrent neural networks have been the most popular deep learning method out of all. For the meta-analysis, we divided them into three separate groups: (a) classical recurrent neural networks (RNN), which were used for prediction in 60 cases, (b) long short-term memory (LSTM), which was used the most, in 289 cases, and finally (c) gated recurrent unit (GRU), which was used for prediction in 48 cases throughout the 136 studies. Additionally, in one case, a combination of LSTM- GRU methods was used for the prediction of traffic speed (Gu et al. 2019a, b). Convolutional neural networks (CNN) were the second most popular deep learning method, which was employed to analyse 278 cases. For the ease of analysis, different variants of CNN methods such as Googlenet (Xing et al. 2019), Resnet (Hu et al. 2019), CNN with attention mechanism (Ran et al. 2019a, b), CNN with generative adversarial networks (GAN) (Lee et al. 2019a, b), graph-based convolution (Zhang et al. 2019a, b, c, d) among other methods have been clubbed together. Combined deep learning models also showed prominence across the studies where mostly CNN model was coupled with a type of RNN model. Combination of CNN and LSTM was most common and was used for predicting 48 cases. Meanwhile, combination of CNN and GRU models were used for prediction in 29 cases. Moreover, CNN was also coupled with other classical RNN structures and was used for prediction 10 cases (see Table 7).

Table 7 Deep learning methods and the area of application

Similar trend was also visible in the distribution of the methods based on the area of application. For traffic flow forecasting, DBN (50) and LSTM (77) models were observed to be the most common. However, in case of traffic speed prediction, the use of CNN (101 out of 278) and GRU models were observed to be high (33 out 48 applied in speed prediction). Use of LSTM models were also observed to be common for traffic speed prediction (90 cases). In addition, for travel time prediction, it was observed that DBN models were the mostly used (44 cases). As the prediction of driver behaviour and congestion often used image-based data, the use of CNN was found to be common in these areas of application (38 and 23 respectively). The use of combined deep learning models was most commonly used to predict speed (46), followed by traffic flow (20), travel demand (9), driver behaviour (4), travel time (3), accidents (3), and congestion prediction (2). Finally, Table 6 also lists out the cases estimated using traditional and shallow neural network methods. Out of 2314, in 820 cases, the variables were predicted using traditional methods. Meanwhile, in 282 cases, they were predicted using SNNs. The next sub-section describes the measures of central tendency in accuracy based on the type of method and the area of application.

Accuracy Distribution Across Different Areas of Application

Figure 1 shows the trends for distribution of accuracy levels for each method and area of application. The distribution is represented with the help of box and whisker diagrams. The boxes represent the mid quartiles, separated by the median value. Meanwhile, the whiskers represent the upper and lower quartiles, whereas the dots in diagrams represent the outliers. Two images are created using different accuracy indicators: the first image (top-image in Fig. 1) was created using MAPE, MRE, and average accuracy indicators and represent the prediction of continuous values. A total of 1878 of out 2314 cases are represented through this image. Meanwhile, the second image (bottom-image in Fig. 1) was created using accuracy indicator and it represents the prediction of discrete states (i.e. classification-based studies). A total of 220 out 2314 cases are represented in that image.

Fig. 1
figure 1

Accuracy box plot based on area of application and type of method

From the plots, it seems that the distribution of the accuracy levels was dependent on the area of application. Accident prediction featured in both images, i.e. they were predicted both as continuous and discrete factors. When predicted as a continuous value, in general, most methods showed a high range of distribution (highest variation in accuracy levels among all areas of application), whereas the median values for accident prediction seemed to be lower than other areas of application. However, when accident was predicted as a discrete factor, then the median value of accuracy seemed to be relatively higher. Meanwhile, accuracy levels for congestion, driver behaviour, and travel demand prediction also showed a relatively higher range of distribution as compared to traffic state prediction variables (see Fig. 1). Traffic state prediction variables such as flow and speed showed a lower range of distribution and higher median values, with usually all methods mostly ranging above 75% accuracy level. In addition, some amount of differences in the distribution with respect to the method applied for prediction were also observed. In the next section, the results of the meta-analysis, discussing the effect of both areas of application and type of method, along with other variables on prediction accuracy are illustrated.

Results

For the meta-analysis, four separate models were developed. The first model (model A) analysed all 2314 samples, testing the effects of deep learning methods, traditional methods, areas of application, type of data source, and the region of study. In addition, the model tested the random effects due to the study (N = 136). For this model, all accuracy indicators were converted into a variable with a value out of 100, denoting the level of accuracy. However, as explained earlier, these indicators have different formulations, so two additional models were developed; model B, with only MAPE, MRE, and average accuracy indicators (as they have the same formulation) and model C, with only accuracy indicator. In addition, as the information on sample size and time horizon of prediction was not available for all the studies, another model (model D) was developed to test the impact of those variables. Model D only considered studies with MAPE, MRE, and average accuracy indicators, which forecasted speed and flow, and contained the information on the sample size and the time horizon of prediction variables. Time horizon of prediction variable indicates the time in future (generally in minutes) for which a variable is predicted, e.g. predicting the traffic flow in the next 15 min or 30 min or 45 min. This section discusses the results of the models.

Meta-analysis with All Observations and Accuracy Indicators

Random Effects

In this model, one variable representing the random effect due to the study was introduced and study specific intercepts were estimated. The random effect allowed us to account for the unobserved intrinsic heterogeneities among the studies. Table 8 reports the variance of the intercept estimates. It was observed that the random effect of the studies had considerable variance and the heterogeneities among them have a major contribution towards influencing prediction accuracy. Moreover, the variance in the random effect parameter was observed to be higher than the residual variance. In addition, the comparison of marginal R2 value, which are associated with the fixed effects and the conditional R2 value, which are associated with both fixed and random effects showed that the contribution of the random effects was much higher (0.175 vs. 0.780). The result of the likelihood ratio test for the study-level random component also show its significant contribution (χ2 = 1847.60***).

Table 8 Results of the meta-analysis model for all samples

Fixed Effects: Methodologies

For methodologies, seven dummy variables (6 for deep learning methods and 1 for traditional methods) were tested as predictor variables in the linear mixed effects model. It was observed that all deep learning methods showed a significant positive relationship with the prediction accuracy. A comparison of parameter estimates across different methods showed that the combined CNN-LSTM had the largest significant positive effect on prediction accuracy, followed by LSTM and DBN models. In addition, DNN, CNN, and SAE models also showed significant positive effects on the accuracy levels (see Table 8). Meanwhile, the effect of traditional methods was observed to be significantly negative. The findings clearly established that deep learning methods produce estimates with better prediction accuracies when compared to traditional methods. In addition, the findings indicate that combined CNN-LSTM models might be better as compared to other models when it comes to prediction accuracies. However, it should be noted that not all models can be applied to predict all variables. The combined CNN-LSTM models have been most commonly applied in the prediction of flow and speed. However, the accuracy levels produced from their application in the field of accident and travel demand prediction showed a high range of distribution (see Fig. 1). The intrinsic properties of a particular model make them suitable for certain applications. For example, it is known that traffic flow and speed have strong spatial and temporal dependencies, i.e. the current traffic flow or speed depends largely on the previous and surrounding flow or speed conditions. In such a case, the combined CNN and RNN-type models would perform well due to its ability to handle both spatial and temporal dependencies. Another possible reason behind the high positive significance of the deep learning models might be linked with researchers using them for the right purpose and area of application. Having a sound knowledge about the nature of the data and applying the most appropriate model for prediction could be possible reasons for the positive relationships. In addition, it should be noted that the papers reviewed in this study were all deep learning-based papers, often proposing the application of new deep learning models. It is possible that when compared with other papers which incorporated and tested only traditional machine learning, the results of the meta-analysis might change. Nevertheless, this study provided an empirical basis to evaluate the significance and contribution of deep learning methods towards improving the prediction accuracy.

Fixed Effects: Area of Application

The effect of six dummy variables representing the areas of application was tested. It was observed that analysing and predicting traffic speed positively affected the prediction accuracy. Meanwhile, predicting travel demand, driver behaviour, and accidents had a significant negative impact on accuracy levels. A comparison of the parameter estimates showed that the dummy variable for accident prediction had the strongest impact, followed by the variable for driver behaviour (see Table 8). Other two variables, traffic flow and travel time prediction, showed positive relationships, but the estimates were not statistically significant.

Fixed Effects: Other Variables

Apart from the type of method and the area of application, dummy variables denoting the source of data and the region of study were also used as predictor variables in the model. The relationship of the dummy variable signifying the use of GPS data was observed to be negative. It should be noted that these results are for all methods and cases. A possible reason behind this might be the nature of errors in the GPS data. Unlike other data sources such as the loop-detectors, it is possible that the errors in GPS data are more random and difficult to model. Using a secondary source to train parameters such as weather data and road network attributes. showed a significant positive relationship with accuracy levels. Meanwhile, other variables denoting the use of loop detector data and image-based data from cameras were not found to be statistically significant (see Table 8). In addition, it was observed that studies which were conducted in an urban area had a significant negative relationship with prediction accuracy. This finding could be intuitively understood as data from urban areas often have a lot of complexities and therefore, the entanglement of these complex factors of variation makes it more challenging to produce a higher prediction accuracy.

Meta-analysis with Selected Indicators

Random effects

Similar to the model with all observations, for both model B and C, i.e. models with observations using MAPE, MRE, or average accuracy indicators (all converted to denote accuracy out of 100) and accuracy indicator, respectively showed that random effect of the studies had considerable variance and the heterogeneities among them have a major contribution towards influencing prediction accuracy. In addition, the comparison of marginal R2 value, which are associated with the fixed effects and the conditional R2 value, which are associated with both fixed and random effects showed that the contribution of the random effects was much higher for both the models (0.212 vs. 0.773 in model B and 0.123 vs. 0.833 in model C). The result of the likelihood ratio test for the study-level random component also show its significant contribution (χ2 = 1204.00*** for model B and χ2 = 166.57*** for model C).

Fixed Effects: Model B

In model B, the findings of the effect of deep learning methodologies were observed to be exactly similar to the findings of model A. It was observed that deep learning methods had a significant positive effect on prediction accuracy, whereas traditional machine learning methods had a negative effect. In addition, similar to model A, the dummy variable for CNN-LSTM method showed the strongest positive effect among all variables for deep learning methods. In the case variables denoting areas of application, ‘speed’ showed a positive relationship with accuracy. Meanwhile, ‘accident’ showed a negative relationship (see Table 8). The effect of other variables was not observed to statistically significant. In addition, in the case of variables denoting data source, only the dummy variable indicating the use of secondary data source showed a statistically significant, positive relationship with prediction accuracy. The effect of other dummy variables denoting data source was not observed to be statistically significant. Finally, the effect of region of study, i.e. whether the study was conducted in an urban area or not, showed a significant negative relationship with prediction accuracy, this finding too was similar to model A.

Fixed Effects: Model C

Many variables tested in models A and B could not be tested in model C because of the low sample size, as model C only considered classification-based studies which used the accuracy indicator (29 studies). For the effect of methodologies, only the variables for LSTM and traditional methods showed statistically significant relationships (see Table 8). The findings were on expected lines, LSTM showed a positive relationship, whereas traditional methods showed a negative relationship. Among the variables denoting the areas of application, only the effect of driver behaviour was tested, and the result was not statistically significant. In addition, the effect of variables denoting data source and target area was also not statistically significant.

Meta-analysis for Traffic Flow and Speed Studies

Random Effects

Similar to the other linear mixed effects models (A–C), the model with a sub-sample including studies which predicted traffic speed and flow and studies which used MAPE, MRE, and average accuracy indicators (model D) also showed high variance. The sub-sample, which contained the information on both the sample size and the time horizon of prediction included a total of 991 observation across 36 studies. In addition, it was also observed that the variance in the random effect parameter was higher than the residual variance (see Table 9). In this model also, the R2 is drastically improved after adding random effects (improved from 0.485 to 0.956), which is indicative towards the high contribution of the random effects in explaining the total variance. The result of the likelihood ratio test also supported the same (χ2 = 1098.80***).

Table 9 Results of the meta-analysis model for traffic flow and speed studies

Fixed Effects: Sample Size and Time of Prediction

Total sample size, i.e. the number of observations used for training, testing, and validation was used as an explanatory variable and its impact on the prediction accuracy was estimated. It was observed that sample size had a significant positive relationship with the prediction accuracy (see Table 9). Although expected, but it is an important finding with respect to the future of deep learning and its application in transport studies. This finding can be interpreted to mean that to better predict traffic speed or flow, the models would require a large sample size. This would mean the requirement of more resources in (1) financial aspects and (2) computational aspects, to acquire and handle the big data available from various sources.

Time horizon of prediction variable, as explained earlier is the time in future for which the variable is estimated and it was observed that it had a significant negative relationship with the prediction accuracy. This means that as the time horizon of prediction increases, the prediction accuracy of traffic forecasting decreases. Predicting long-term (few hours to weeks) traffic estimates have always been identified as challenging and are prone to errors (Jiang and Adeli 2005). However, prediction accuracies in the period of 60–120 min also need to be improved. Accurate traffic prediction for longer terms can prove to be an important aspect of transportation planning and management. Better prediction accuracies across different time horizons would increase the possibility of introducing challenging but efficient policies such as dynamic road pricing by providing users with the authority to make more flexible and sustainable transport choices.

Fixed Effects: Other Variables

The meta-analysis model using the sub-sample of traffic forecasting studies also tested the effects of methodologies and data source types. The findings were observed to be similar to that of the earlier models (A and B). It was observed that deep learning methods showed significant positive effects on prediction accuracy, with the combined CNN-LSTM model having the strongest relationship with traffic forecasting accuracy (see Table 9). Meanwhile, the relationship of traditional methods was observed to be significantly negative. The effect of data source type by creating a dummy variable for loop detectors was also tested. It was observed that it had a significant positive relationship with accuracy.

Conclusions

This review paper conducted a comprehensive survey of the application of deep learning methods in transport studies. Following a detailed search strategy, a total 136 studies were selected. These studies were reviewed, and information was extracted for several important variables to test their relationship with prediction accuracy. Before this study, three more review studies had concisely summarised the type of methodologies and the areas of application. This study extends these previous works and adds new knowledge in the following ways: First, as the application of deep learning in transportation is fast growing, many new studies were not reviewed in the previous review papers were added and reviewed. Second, we analysed the papers dealing with the application of deep learning with respect to various other variables which were not looked into earlier, including the coverage and region of the study, the type of data source, sample size, time horizon of prediction, and the prediction accuracies for different methods and areas of application. Finally, by conducting a meta-analysis we could empirically establish the relationships between influential factors and the prediction accuracy. In this section, we would like to discuss and summarise our key findings with respect to the future of deep learning and AI in transportation. The summary is presented in relation to the most relevant factors analysed in the study. Figure 2 presents a graphical summary of the survey, representing the type of deep learning methods, data sources, and areas of applications used across the 136 studies. In addition, it shows the connections between these variables vis-à-vis prediction accuracy.

Fig. 2
figure 2

Deep learning in transportation: a graphical summary

Prediction accuracy The meta-analysis considered accuracy levels as the dependent variable and tested its relationship with different variables. High prediction accuracy might be one of the most important reasons behind the growing popularity of the deep learning methods. Moreover, obtaining a high rate of prediction accuracy would also ensure the formulation and implementation of successful transportation policies based on future forecasting. People’s use of and dependence on information and communication technologies (ICT) have been continuously increasing. The predictive powers of our algorithms should be high enough to correctly estimate the interactions between ICT systems, transportation market, and the higher order impacts on land use, congestion levels, changes in emission levels, etc. AI would somehow be at the centre of all this. However, this analysis clearly established that not all methods are equal, though the required accuracy level for practical use would also depend on areas of application. In addition, it was observed that there existed high variance due to the heterogeneities intrinsic to a study and the accuracy significantly depended on many other predictor variables. The challenge in front of future researchers and policy-makers would be to identify and control these factors.

Methodologies The results of the analysis showed the differences among the methodologies. Combined CNN-LSTM models showed the strongest positive influence. Meanwhile, other deep learning models also showed positive effects. CNN-LSTM models are an improvement over certain deep learning methods as the combination of both methods enables to extract spatiotemporal features and correlations (Yu et al. 2017a, b, c). CNN is utilised to understand the spatial dependencies. Meanwhile, LSTM is employed on the time axis to understand temporal dependencies. Currently, a high percentage of this method’s application has been witnessed in traffic speed and flow forecasting, but it has also been applied to other areas. Accidents, travel demand, and congestion can also be predicted using the same. However, the variation in the accuracy levels was observed to be high in other areas of application. Another possible drawback to this method can be related tp RNN-based networks being difficult to train and requiring high computational power. Studies such as by Yu et al. (2017a, b, c) have tried to overcome these issues by employing a fully convolutional structure on the time axis and developing spatiotemporal CNNs. For this analysis, spatiotemporal CNNs have been clubbed with other CNN models and their effects were also observed to be positive. Ultimately, it would be upon policy-makers and researchers to make a trade-off between training time, computational requirement, and prediction accuracy. The decision of which type of model to select would also depend upon its area of application.

Areas of application and data sources Ten distinct areas of application could be identified from the survey. The major focus has been on the prediction of traffic speed and flow, with 1509 out of 2314 cases being from these two areas. A possible reason behind this might be the available loop detector and GPS data, which has been mostly used in traffic state prediction (see Fig. 2). However, certain studies also utilised information obtained from car-ITS (Xiangxue et al. 2019) and camera-based sources (He et al. 2019) to predict traffic parameters. The meta-analysis model showed the significance of four areas of application variables: traffic speed, travel demand, driver behaviour, and accidents. Given the high amount of data available for traffic forecasting (i.e. speed and flow) and the amount of work conducted in this field (Do et al. 2019; Wang et al. 2019), the positive effect was an expected finding. The challenge currently is to improve the performance in areas where the models have showed a poor performance, e.g. travel demand, driver behaviour, and accident prediction. Travel demand studies utilised data from GPS, mobile applications, and AFC data sources. Meanwhile, driver behaviour studies have utilised four types of data sources, GPS, camera based, AMGB, and ITS based. In addition, it was also observed that accident prediction utilised GPS, camera-based, and secondary accident data to develop deep learning architectures. These areas of application are important aspects of transportation with pertinent policy implications. For example, driver behaviour is an important area of application with respect to the discussions around automated vehicles (AV). AVs can be classified into five levels of automation and accurately predicting driver behaviour would be very important in cases of partial and conditional automation. There has been immense focus on the image-based classification of vehicles, traffic signs, and pedestrians. However, improving the accuracy levels for driver behaviour prediction would be an important task for future.

The other areas of application (driver behaviour and accident prediction) also need further improvement as their relationships with accuracy level were observed to be negative. Data from ITS such as AFC systems (Tang et al. 2019a, b) have provided an opportunity to predict travel demand in public transport modes. Accurate predictions of demand would be beneficial in future policy-making such as information provision on crowding inside public transport. Moreover, it might be useful in the application of mobility as a service (MaaS) schemes. However, a major challenge in the field of transportation that remains is the black-box nature of deep learning methods and the difficulty in their interpretation (Wang et al. 2019; Zhang and Zhu 2018). Travel behaviour models for mode choice and activity participation and traffic flow models are grounded in theory and a major challenge for future would be to improve the interpretability of the deep learning models. In this regard, for example, a discrete choice model with neural network elements can be developed without the compromise of behavioural interpretability by adding a certain constraint on the parameters that need to be behaviourally explained (Sifringer et al. 2018).

Sample size and time of prediction There was no unique way found in the literature to report sample sizes. Data collected from loop detectors often reported the number of days for which and the number of loop detectors from where the data were collected (Tian et al. 2018a, b). In addition, they also reported the frequency of aggregation (e.g. 1 min, 5 min, or 15 min). Meanwhile, data from GPS sources reported trip trajectory information which was then utilised to extract speed, flow, travel demand data, or other information based on the area of application (Bao et al. 2019a; Elhenawy and Rakha 2017; Zhu and Laptev 2017). Contrasting to these sources, image-based sources such as CCTV and other cameras often had a lower sample size but was utilised to extract multiple features in a single image (Eraqi et al. 2019). Given the difference in scales of the data, it might be difficult to compare across studies. In addition, many studies did not report on the sample size. However, the meta-analysis showed a significant positive effect of sample size on prediction accuracy. Higher sample size would also mean the requirements of higher computational abilities and the challenge for future researches would be to find the correct balance between size of data and the required prediction accuracy. Finally, most studies that have been analysed have predicted the short-term traffic forecast but the time period of prediction for short term prediction also varies. The meta-analysis established that as the time of prediction increases, the prediction accuracy tends to decrease. Accurate predictions which are 30 min to 2 h in advance can prove to be beneficial for both policy-makers and individuals for their personal trip planning. Dynamic congestion pricing, MaaS, traffic state prediction during disruptions and big events are possible areas of application which might be benefitted from improving this aspect.

Along with the key findings, contributions, and discussions for future research, it is important to discuss the limitations of this study. This study did not consider the work done in image-based classification of vehicles, traffic signs, pedestrians, and pavement crack detection. These areas of applications have been comprehensively covered in Wang et al. (2019). Rather it focused on travel-related and travel behaviour factors. Moreover, studies which used deep learning in traffic signal control were not included as they did not usually involve the prediction of any factor. These are important areas of application and not considering them remains a major limitation of this work. Moreover, the papers considered for this research are limited by the search strategy and there is a possibility that relevant work conducted in this field were not included. In addition, the meta-analysis combines the work done in both discrete state prediction (driver behaviour, activity state, etc.) and continuous value’s prediction (traffic flow and speed), while combining the different accuracy indicators across these studies. The combination across different scales was primarily done for the ease of analysis and comparison. Finally, hyper-parameters related to deep learning models such as the number of hidden layers and epochs were not considered as part of this analysis.