1 Introduction

Rapid urbanization and associated increase in population are resulting in a higher growth of motorized traffic flow in urban areas. As a consequence, cities are experiencing different problems such as air pollution, road accidents, and congestions. In response to these problems, public transportation (PT) could help to reduce air pollution, road congestion and travel time, and dependency on non-renewable energy, which benefit both riders and non-riders71. Understanding travel pattern and travel behavior of PT users, forecasting their demand, performance improvement of PT service, etc., are central to PT planning. Traditionally, public/transport research domain was reliant on manual travel surveys conducted onboard, at the stops/stations or at household level14, 17. These methods are useful in describing socio-demographic characteristics (age, gender, income, occupation, etc.) of the respondents along with their detailed travel information (trip purpose, mode choice etc.). However, these methods are labor intensive, leading to higher costs and consequently smaller sample size and lower update frequencies. This is particularly problematic in the context of the developing countries in the Global South where the resources are more limited. Further, the data can have reporting errors and are prone to statistical biases17, 27. Most importantly, data collected through manual surveys are typically non-panel in nature and unable to capture the short-term/long-term variations in PT usage. Due to the coarse spatial characteristics and non-dynamic nature, it is typically not possible to combine such data with land use, weather, and dynamic network conditions. This makes it difficult to use it for PT planning and operational decisions.

On the other hand, the rapid advancement of information and communication technology (ICT) has brought revolution in the domain of transport research over the last decade. These range from PT-specific sources like smart cards used for automatic fare collection and global positioning system (GPS) traces for automatic vehicle location (AVL) to more generic data like digital footprints of mobile phone users, geo-coded social media records, etc. The ubiquity of these data sources has led to passive generation of unprecedented amount of data, which are precisely geo-referenced and spatiotemporal in nature and suitable to explain human mobility patterns at much lower costs2, 57, 58. Other heterogeneous data sources include loop detectors (collects traffic data), probe vehicles (measures traffic condition), Bluetooth (enumerates travel times or average speeds and the associated variability), video cameras, remote sensing, street imagery such as Google Street View (GSV), and Bing StreetSide27, 55. Given that most of these data sources are large in volume, they fall into the general umbrella of big data.

In comparison with the traditional data sources, these novel sources of data show many unique attributes and advantages17, 26, 69. Firstly, big data sources contain updated and near or real-time spatial and temporal information that is quite impossible to collect through traditional travel survey (e.g., face to face interview, telephone interview, travel diary, and web form survey). Secondly, they contain a large amount of individual level data with greater details and higher accuracy at lower cost. Some of these data can be potentially linked with supplementary data (e.g., land use, bus time tables, etc.) as well as with each other (different types of data of the same person), though data linking has the risk of breaching privacy issues. Thirdly, these can be used to reconstruct large-scale trajectoryFootnote 1 data for a larger sample size and longer observation period. The availability of such large dataset unveils possibilities for more dynamic research in the field of transportation planning. Despite these opportunities, there are also challenges associated with collection, processing and analyzing of big data, which need to be addressed while employing these data in transport planning. Further, mining of big data for its meaningful application requires different methods and techniques for data processing and transport modeling, which raise new technical challenges to ensure computational efficiency, data processing, integration, evaluation, validation and user privacy. Major challenges include: (1) presence of data gap (e.g., discontinuity in the location data, errors and missing information); (2) some details (e.g., trip purposes, accompanying travelers) are not explicitly recorded; (3) in case of some forms of big data, termed as extrinsic mobility data, the location information is generated by non-transport activity (e.g., during phone call, text message etc.); hence, cannot be converted directly to mobility data for transport studies which often require significant processing; (4) absence of personal or socio-demographic information of the user, which are key inputs to some of the traditional models (econometric models for example).

In the last two decades, many articles had reviewed the application of different big data sources in transportation planning. Among the current review-based studies, Anda et al.5 reviewed the current application of big data sources to understand travel behavior and develop travel demand models for transport planning. They prescribed using big data for ‘activity-based model’ and ‘agent-based simulations’ to understand individual travel behavior. They further described recent advancements in data-mining methods, applied for trip identification, activity inference, and mode inference at the individual level. Yue et al.74 reviewed current studies related to travel behavior, which used trajectory-based data and explored scattered technologies, tools, and data sources (traditional travel survey data, GPS log data, smart card data, mobile phone data, and other non-conventional sources such as social media and banknote data). They have also documented major challenges of using trajectory data in behavioral studies such as data privacy, data accessibility for different stakeholders, socio-demographic biasness in available data, and absence of new data modeling techniques to reduce computational cost. Additionally, they highlighted new opportunities of complex network science, computational social science and bottom-up approach in travel behavior studies. Further, Wang et al.69 provided a review of travel behavior research based on mobile phone data, obtained from different sources (cellular network and smartphone sensor-based data). They highlighted that the majority of these studies are focused on the re-identification of human travel patterns (such as travel frequency and distance), but failed to exhibit the inherent mechanism responsible for the observed pattern. They have also elaborated the major potentials and challenges of using mobile phone data in travel behavioral studies. Chen et al.17 reviewed the current methodologies of using mobile phone data for travel behavior analysis in three sub-areas: modeling travel behavior, behavioral factor, and human mobility pattern. They raised emerging necessities for cross-discipline research with the aim of conversations and collaboration among different disciplines. Huang et al.34 reviewed the advantages and disadvantages of different methods to detect transport mode based on mobile phone network data. They emphasized that most of the studies were focused on easy-to-detect modes due to the lack of ground truth data. Also, they highlighted that most of the studies did not validate their results, or simply validated their proposed methods with aggregated data. However, the above-mentioned reviews are generic, whereas the research needs, available big data sources, analyses, and analyses methodologies are distinctly different for PT.

In the context of PT, Pelletier et al.57 focused on the application of smart card payment data in PT, showing that in addition to fare collection this data can be used for many purposes such as strategic-level studies (long-term planning), tactile-level studies (service adjustment, transfer journey studies, etc.), and operational level studies (calculation of PT performance indicators, payment management, etc.). This study however focuses only on smart card data and is slightly dated. In particular, over the last decade, there has been emergence of many other types of big data which can be used for PT planning. Therefore, further study is needed to investigate what the new big data sources have to offer and what novel methods can be deployed to best utilize them for informed decision-making to guide the improved performance of the PT modes.

The main purpose of this systematic review-based study is to explore the recent research in PT planning using big data and assess the usability and potentiality of the novel data sources (in comparison to conventional data such as household travel diary surveys and population census). Therefore, existing relevant literature is reviewed according to different aspects of PT planning such as analyses of travel pattern and understanding travel behavior. Our review differs from existing review-based articles in the following ways. Firstly, our focus is on the research, targeting to validate the application of big data source in PT planning. Secondly, the aim is to give the reader an overview of the application of big data in the domain of PT planning. The article is organized in the following order. Section 2 summarizes the methodological approach followed in this study. Section 3 includes available novel data sources used directly/indirectly in PT research. Section 4 critically reviews the selected paper. Section 5 explores the conclusions and future research path.

It may be noted that most of the articles related to big data use in PT studies were from North America and Asia. About 11% reviewed articles were from North American and 18% from the Asian context. The USA, Canada, UK, China, South Korea, and Australia were among the leading countries practicing big data extensively for PT planning. From the Global South, very few articles were from Chile and Brazil. Only one article was found in the African context.

2 Methodology of Review

This study adopted a three-stage systematic literature review approach proposed by Bask and Rajahonka11. The stage 1, entitled as “Planning stage”, includes objectives and review protocol for a review, defining sources and procedures for article/paper searches. At this stage, we identified our research question based on our research aim and objective: What is the current state-of-the-art application of big data in PT planning? We further selected the inclusion and exclusion criteria for the final review. To ensure a comprehensive search, our database included five data sources: Scopus, Science Direct, Wiley Online Library, Taylor & Francis, and Google scholar. The following keywords were used to search articles: “Big Data” or “Smart Card” or “Mobile Phone” or “Social Media” or “Passive Data” AND “Public Transport” or “Transportation” AND “Planning”. The same key words were used in the five data sources to avoid any biasness in the search process. A total of 272 articles were found after this stage. The inclusion criteria were determined to fulfill the research aim.

The stage 2 is the “Screening”, which includes descriptive and structure analysis. The titles and abstracts of 272 articles were screened for preliminary selection. The screening was primarily based on the following criteria: accepted/published academic journal articles, full text available, and published in English. We found a total of 102 articles at this stage. When the abstract of an article indicated a research aim similar to our study, the full text was scanned to include it in the review pool. Besides, we excluded articles appearing redundantly in different search engines, editorials, book chapters, conference proceedings, and articles not in English. Finally, 47 articles were selected for critical review. The selected 47 articles were further categorized based on qualitative pattern matching (similarities/dissimilarities) and multiple sources of evidence analysis (validation and acceptance of hypothesis)73. Thus, the aims and hypotheses considered in the selected article were tabulated and the articles were categorized under three themes according to the respective research purpose. Other supporting articles were also included during the review of the 47 articles to elaborate and support the overall findings of the reviewed paper.

In Stage 3 “Reporting and Dissemination”, we organized our findings to write the review paper. In the writing stage, we followed the category developed in the “Review Stage” to summarize our findings. Eventually, we explained the significance of the reviewed paper and their contributions in the PT research domain along with the research gap and opportunities of future research.

3 Types of Big Data in PT Planning

Among the different definitions of big data, we refer to the simplest one: “any data that cannot fit into an Excel spreadsheet”12. These encompass automatically and routinely generated diverse dynamic information coming from different sources (e.g., sensors, devices, third parties, web applications, and social media) at various speeds and frequencies12, 17, 59. These data sources are thus also aligned with the definitions proposed by Laney41 and 31, where big data is defined with 3 Vs (volume, velocity, variety) and 5 V + C (3 V + variability, veracity, complexity), respectively.

From the last two decades, a plethora of studies applied multiple passively collected data sources for transport planning research. However, the discussion regarding the application of big data for PT planning is fragmented and distributed in different outlets of PT planning domain. Systematic and comprehensive reviews on the application of big data in PT planning are unavailable. Hence, in this study we only focused on disparate application of the novel data sources for PT planning. The systematic review of literature revealed three key categories of data for PT planning.

  1. (a)

    Smart card data The key purpose of smart card data is to ensure a smooth automatic fare collection in PT. Currently, many large-/medium-sized cities in the world such as London (Oyster card), New York (SmartLink), Boston (Charlie card), Beijing (Yikatong), and Hong Kong (Octopus card), have their own smart card system. These cards are based on radio-frequency identification (RFID) technology and passengers are required to tap the cards during entry and/or exit. Depending on the type of smart card, it automatically and continuously collects different trip records while using PT. For example, an entry-only smart card records information (boarding time, location of stop, transport mode, and/or stop number) when passengers enter/board a transit station10, 25. Hence, no information in destination stop is recorded. On the other hand, multimodal smart cards such as London Oyster card also collects information related to both entry and exit point and transport mode28. Therefore, information collected from the smart card can be used for PT planning other than merely the fare collection57.

  2. (b)

    Mobile phone data At present, most individuals carry mobile phone almost everywhere, which results in mobile phone data—the largest human mobility data source8. There are broadly two sources of mobile phone data—cellular network-based data and smartphone sensor-based data69. Cellular network-based data are collected by telecommunication companies. Two types of network-based data have been used in the contemporary PT studies: call data record (CDR) and global system for mobile communication (GSM) data1, 43. CDR data comprises a set of phone activity (phone call, text message or Internet access) records along with the time and location information of cell towers channeling the call. The GSM data are generated from an interaction between a device and the mobile network as long as it is turned on69. For a single mobile phone, CDR or GSM data are dispersed and provide very little information. However, aggregation of thousands of mobile phone data overcomes the above-mentioned limitation1. Among the two types, GSM data has a higher frequency compared to CDR data, but is typically more difficult to get access to. CDR data on the other hand is routinely saved by the mobile phone companies for billing purposes and involves no additional effort in data provision.

On the other hand, smartphone sensor-based data can be collected by dedicated applications69. For example, check in information from social media (Twitter, Facebook, etc.) or popular sports tracking apps provide higher spatial resolution data compared to network-based data. However, this form of data is associated with serious sampling bias and poor temporal granularity; therefore, very few studies used this data form for PT research58. Nevertheless, advancement of Internet and smartphone technology provides a unique opportunity to transport planners/modelers to estimate demand fluctuations under special events (e.g., Olympic Games, Formula 1).

  1. (c)

    GPS data and automatic vehicle location (AVL) The GPS technology installed in the vehicle enables collecting time, location, and service status of transport modes66. AVL database is prepared by collecting various geo-coded information about the mode (such as latitude, longitude, time, date) using GPS (on vehicle) at a constant or varying time interval. AVL data is widely used in conjunction with smart card data in different outlets of PT planning research30, 48, 49, 54, 80, and extends the application of geographic information system (GIS) to perform spatial and temporal analysis in an urban landscape. Even though GPS data has many other applications (in private transport) such as car route mapping, measuring taxi service, freight tracking, and commercial fleet management, in this study we only considered GPS data that is collected in the PT modes and used to develop the AVL database.

Other heterogeneous data sources such as loop detectors, Bluetooth, and video camera can be considered as big data by definition and used in the transport research domain; however, these were not included in this study, as their application was non-specific in the reviewed articles on PT research domain.

4 Current State-of-the-Art of PT Studies Using Big Data

4.1 Theme 1: Use of Big Data in Travel Pattern Analysis

In the broad spectrum of urban planning and transport studies, it is important to understand and model how individuals move in time and space, called travel behavior analysis, which is important to understand the travel demand14, 18. In the context of PT, travel pattern and related statistical analyses are important while adopting plans to improve current and promote new PT services. In this section, we explore whether “big data” can substitute the conventional data collected through field survey for travel behavior analysis in PT planning.

More than one-third (36%) of the articles reviewed in this study were focused on travel behavior analyses of PT users using big data. Major research found under this theme is about—identification of aggregate/individual, single day/multiday travel behavior, inference of trip purpose, socio-demographic status of transit users, analysis of activity pattern, and spatial and temporal variability in transit use (Table 1). As seen in the Table 1, all authors mainly used smart card data for travel behavior analysis of transit users. Along with smart card data, AVL data were also incorporated in travel behavior analysis.

Table 1: Review of studies on big data for travel pattern analysis in PT planning.

4.1.1 Aggregate vs Individual Travel Behavior

Day to day travel behavior analysis is difficult using conventional data, due to the cost and complexities associated with the data collection method. Among different novel data sources, smart card data has the capability to enhance existing decision-making tools, minimizing the complexities associated with conventional data collection systems (e.g., of onboard surveys)42. Several studies explored the potentiality of this data for aggregated and individual travel behavior analysis37. Zhang et al.75 proposed a method to identify group travel behavior (GTB) with PT smart card data based on proxemics theory. Although aggregated behavior analysis draws the trip pattern of the general user, it fails to capture the individuality in travel behavior36. Hence, Kieu et al.36 proposed a new method entitled Weighted Stop Density-Based Scanning Algorithm with Noise (WS-DBSCAN) using smart card information to detect the spatial variability of individual travel pattern. Smart card data was further used for single day to multiday travel behavior analysis. Zhao et al.77 proposed a regularized logistic regression model to predict daily individual travel pattern using historic sequence of individual trip records collected from smart card and reached 20–30% accuracy in predicting time, origin, and destination combinedly. In addition, studies proposed data-mining methods and probabilistic models to analyze multiday travel behavior and regularities of individual and subgroup14, 21.

Since the proposed methodologies aided in predicting individual and aggregated trip patterns of PT users using smart card and AVL data, the outcome of the research would be useful to propose short-term and long-term policies and strategies to support predictable users. For example, the proposed trip prediction models can be used to develop traveler information system which will inform the user about service shortage and delays in certain areas77. As smart card and AVL data are collected routinely and passively, such behavioral prediction can be done within a short/long time interval which will provide the opportunities to evaluate the impact of policy changes.

4.1.2 Inference of Travel Behavioral Attributes

It is considered that additional attributes such as socio-demographic and activity information improves the understanding of travel behavior21. Smart card or other passive data sources are criticized due to the absence of socio-demographic information, which is collected in survey-based methods. Hence, efforts have been made to integrate socio-demographic information with big data to portray a comprehensive picture of travel behavior. Several studies proposed data-mining and probabilistic models to use smart card data to infer trip purposes3, 23, 40, 42. Goulet-Langlois et al.29 followed agglomerative hierarchical clustering techniques to understand passenger heterogeneity from longitudinal representation of each user’s multiweek activity using smart card data. Further, Amaya et al.4 attempted to identify residence zone of smart card users to include this socioeconomic variables in travel pattern analysis.

Moreover, various studies assessed spatio-temporal variability of PT use using clustering techniques and probabilistic model integrating smart card multiday data record38, 47, 49, 51, 63. Wang et al.68 proposed a location choice model of metro commuters for predicting after-work activity location using smart card data. The method proposed by Ma et al.47 achieved 94.1% overall accuracy in identifying commuter trips. Further, Long et al.46 classified different extreme PT riders using both traditional household data and smart card information. Besides, big data was used to determine the influence of external factors such as weather conditions on transit ridership using regression analysis7. Therefore, by synthesizing these novel data sources, more meaningful insights about travel behavior can be inferred.

Though manual survey contains socio-demographic attributes and activity information, this information is more often associated with sampling bias. Therefore, conventional models developed using manual survey-based data are often misleading to interpret travel pattern. On the other hand, the proposed models (followed classification, clustering, and prediction techniques) developed using big data are more dynamic in nature, capable of incorporating multiple data sources from fine to coarse resolution to predict socio-demographic attributes, and reduces the requirement of conventional survey for want of socio-demographic information.

4.2 Theme 2: Use of Big Data in PT Modeling

Since the 1960s, one of the most prominent and widely acknowledged transport modeling approaches has been the four-stage modeling44. It is widely used as a systematic framework for both public and private transport modeling, which follows the sequence of (a) trip generation , (b) trip distribution , (c) modal split, and (d) trip assignment. This modeling approach requires a large amount of spatio-temporal individual trip information and demographic data, the collection of which is a manpower-, time-, and investment-intensive process5. After the emergence of big data sources, efforts have been made to use big data as an alternative to conventional survey data in transport modeling. But the question remains as to whether it is possible to replace the costly conventional survey data using big data in PT modeling. About 14 (30%) articles reviewed in this study answered this question.

Among the different sources of big data, the use of smart card information in PT modeling is observed in all 14 relevant articles reviewed in this study (Table 2). After the emergence of “smart card” for PT fare payment, it is considered as a useful data source for the planner and researcher, since it produces a large amount of boarding/alighting information depending on the type of card57.

Table 2: Review of studies on big data for PT modeling.

4.2.1 Estimation of Origin and Destination (O–D)

At the beginning of the last decade, this form of data was employed to elicit basic information required for transport modeling such as origin and destination (O–D) to quantify transport demand between geographical regions in a city. Various algorithms were proposed to determine station-to-station O–D trip tables by using smart card data for unimodal PT trip (subway/bus)10, 25, 65, 67. Then, various studies proposed methods suitable for large multimodal PT system to estimate the O–D matrix from smart card data9, 28, 54, 76. As such, along with rail to rail trip, Zhao et al. 76 considered rail to bus trip to develop the O–D matrix. Further, Barry et al. 9 incorporated multiple modes (subway, local and express buses, ferry, and tramway) for O–D estimation. In these studies, other data sources such as AVL data and geo-coded GPS data were infused with smart card information. Using the proposed algorithms by Barry et al. 9, O–D trip tables were created for short-term and long-term demand estimation. Besides disaggregate O–D estimation, a study conducted by Tamblay et al. 62 also attempted to infer zonal O–D matrix form smart card information to reflect the city-wise travel demand. Since PT journeys often comprise multiple transfers, smart card data were also used in the contemporary studies to understand linked trips to derive more complete information of PT trip. Methods were proposed to infer passenger journeys and analyze transfer pattern 20, 28, 54, 60. Through the analysis of transfer pattern of linked trip, multiple modes used to complete the trip were also detected to complete the O–D estimation.

To validate the various methods for the use of smart card data and other passive data sources in O–D estimation, multiple validation techniques have been proposed in several studies. As such, travel diary, manual survey data, and historical observations were used to validate the methods proposed to estimate the O–D matrix 10, 28. These methods can predict O–D with an accuracy level ranging from 66% to 90%. In addition, Munizaga et al. 53 proposed an endogenous validation method (analyze the data to verify assumption) to validate the assumptions considered for PT origin–destination (O–D) matrices using smart card data and survey data. On the other hand, Kumar et al.39 proposed a new trip-chaining algorithm for O–D inference that tries to relax the assumptions on the parameters such as GPS inaccuracy (buffer zone for boarding stop inference).

4.2.2 Route Choice Modeling

Three articles dealt with route choice of travelers in PT networks by considering multiple attributes. The interchanges between the segments of a linked journey can be recognized using smart card data infused with other data sources35. Nassir et al.56 proposed a recursive link-based path choice model using smart card data in higher-frequency bus and rail services and added a new measure of “attractiveness” to allow for randomness in the choice of attractive routes. Cheon et al.19 proposed a trip assignment model considered the coexistence of various modes in a single network considering multiple attributes to effectively reduce the unreasonable paths. Besides, Kim et al.37 introduced a new attribute entitled as ‘stickiness’ to understand individuals’ habitual route choice through cross-sectional and longitudinal analysis using the whole trajectories of individual passengers, constructed from smart card data.

Therefore, the application of big data in PT modeling is observed in three (trip generation, distribution, and trip assignment) of the four stages of transport modeling. Future research is needed to delineate modal classification using big data source. Thus far, it was attempted to establish passively collected data as an alternative source of conventional survey data for PT modeling. By using different data sources (smart card, AVL, travel survey data, etc.), the proposed models exhibit promising result in developing aggregate and disaggregate O–D matrix. After precise identification or prediction of origin and destination of different trips, outcomes (spatial and temporal information of trip) from the models would be useful for further analysis to understand the reasons behind the trip-making behavior, seasonal and daily travel demand, and macro- to micro-level interaction among the factors governing travel pattern,. Also, the proposed methods overcome the challenges to capture the complex travel pattern in modern PT system (involving multiple transfers and modes) that may be quite impossible to infer precisely from traditional survey data (e.g., cordon line, screen line, or household survey data).

4.3 Theme 3: Use of Big Data in PT Performance Assessment

In recent years, big data has been applied in PT performance assessment . By reviewing 16 selected articles under this theme, we attempted to answer the question on whether big data can substitute conventional data in assessing performance of PT service. Unlike the previous two themes, where the hegemony of smart card was observed, here, the use of other passive data sources (e.g., mobile phone, social media) along with big data was evident.

Performance assessment has become an integral part in transport planning and management, which helps to ensure accountability, transparency, and service quality16. The resultant information from performance assessment is necessary for the decision makers to evaluate investment alternatives in the transport sector61. Therefore, augmenting the multipurpose uses of performance assessment expedites the development of performance measures , which are the integral part of the performance assessment13. Review of studies on big data in PT performance improvement is summarized in Table 3.

Table 3: Review of studies on big data in PT performance improvement.

4.3.1 Measurement of Performance Assessment Indicators

To use big data in performance assessment of PT services, attempts have been made to measure different performance indicators of PT. The majority of the existing contributions focused on developing methodologies for PT performance assessment. In the reviewed articles, big data was used to estimate regular performance measures such as quality of PT service using GSM data1, physical and schedule-based connections of metro user using quadruple33, bus arrival time using smart card data79, left behind passenger using smart card data and AVL data80, accessibility to PT service using mobile phone data43, passenger waiting time using smart card data64, and spatial variations of urban PT ridership using GPS trajectories and smart card data66. Further, Min et al.50 proposed a method to recover the arrival times of trains from the exit times of metro passengers.

4.3.2 Evaluation of Performance

The use of big data is also observed in evaluating the performance of PT service such as the importance of access and egress times to/from high speed railway (HSR) stations52, impact of fare policy change on PT ridership45, and the relationship between transit fare, space and justice78. Also, an attempt has been made to create online data-driven platform for performance measurement in Beijing, China48. Liu et al.44 proposed a method to replicate the multimodal PT system using smart card data and the resulting replication covers about 96% of trips made in PT in Singapore. Also, using cell phone data, a comprehensive dataset was built for para-transit service for performance improvement in Nairobi, Kenya70. Use of big data is also observed to monitor special events/circumstances. Pereira et al.58 developed method to predict PT arrivals on the time of special events using Twitter data. In addition, to ensure prediction accuracy of the impact of planned, temporary disturbances (such as temporary track closures) of PT usage, Yap et al.72 proposed a method using smart card data.

To initiate PT improvement and management programs, the prerequisite is to measure the performance of existing PT system to elicit problems, their root causes, and sectors requiring special attention24. Even though the performance assessment of PT service has been acknowledged as an effective planning, management, monitoring and evaluation tool, it is less prioritized and practiced in developing countries due to the scarcity of data required for performance assessment, even after the implementation of a project. Lack of tangible outcomes from performance assessment could create political negligence, compared to investment-intensive infrastructure development program. The application of big data for measuring performance indicators (e.g., service quality, accessibility to PT) and evaluation of performance (under special circumstances) has overcome the budget constraint associated with conventional data collection methods. An evolution of such application could enable the decision maker to use big data in public transport policy making6.

5 Conclusion and Future Research Direction

This systematic review-based study critically analyzed the current advances in application of big data in PT planning. Following a three-stage review process, we categorized 47 review paper under three subsection—travel pattern analysis, PT modeling, and PT performance assessment. It is found that the high potential and usefulness of big data (particularly smart card data, mobile data, and AVL data) in PT planning is widely acknowledged. The general finding is that the emerging big data sources provide at least as good if not better models/tools for PT planning. Further, it is claimed in the majority of the studies that due to the ‘by-product’ nature of such data, these tools are cheaper to develop compared to those involving collection of traditional data.

However, the majority of the reviewed studies have focused on investigating conventional PT planning topics (e.g., O–D estimation, route choice modeling) with passively collected data sources. There is a research gap in extending these to discover more novel applications of big data for PT planning. One particular promising direction in this regard is the potential to develop more dynamic planning models that better utilize the panel nature of the data in modeling the variability in behavior and the interaction of different influencing variables over time. For example, the trip-chaining data available from the smart cards can be utilized to optimize the overall transfer times; the increased/decreased boarding numbers on a certain bus stop can be utilized to make minor adjustments in the dwell time at the subsequent stops to prevent systematic ‘bus bunching’; establishing relationship between weather and OD patterns can be used to improve the seasonal changes in the time table.

Existing algorithms and models in the contemporary studies to predict origin, destination, and route choice have many applications and substantial level of accuracy. But, the development of these models involved multiple steps and considered many assumptions and sampling approximation. While the validity of these models depends on the accuracy of these assumptions and sampling approximation, very few attempts have been made to validate these assumptions in a context different from the training/estimation context53. Therefore, future research is needed to propose new techniques to validate the underlying assumptions in PT modeling using big data and potentially provide insights about their forecasting performance.

Further, though there are possibilities of combining multiple big data sources or big and small data for PT planning (e.g., smart card data with mobile phone data), only few studies have focused on this aspect on a limited scale. There is potentiality to develop wide-scale models combining land use, weather, events, and other big data sources to understand travel behavior in different landscapes. While there exist various sources of big data, methods to integrate these different types of data to solve problems (e.g., congestion and service deficiency), which could improve the accuracy in predicting individual travel behavior, do not exist. Existing sources of big data are often criticized for not providing information such as socio-demographic characteristics of the passengers. Integrating data collected from small-scale surveys and correcting the potential biases with the big data sources (as proposed by Bwambale et al.15 in the context of generic travel behavior) can provide more insights into travel behavior.

In addition, cross-cutting research is needed to explain the applicability of big data in the extended domain of PT, such as spatio-temporal relationship between origin and destination choice, transit users’ transfer choice, and mode choice behavior19. Besides, disjointed O–D estimation, route choice, transfer choice, and mode choice research can be integrated to understand the complex trip chain and travel behavior25. Trip-chaining and linked trip analysis can be further extended for inferring trip purpose, analyzing spatial and temporal travel pattern, and analyzing route choice behavior of passenger32, 39. To improve the performance of PT service, research is needed to determine the relationship between passenger travel demand and performance indicator such as speed of vehicle, quality of service, accessibility to PT, and passenger waiting time22, 43, 64.

In understanding travel behavior (individual/aggregated), clustering techniques such as density-based spatial clustering, K-means++ clustering, and Gaussian mixture model are generally applied. However, application of supervised classification process, interpretation, and validation with onboard data are needed to classify the heterogeneous transit user42. But, to understand activity-based travel behavior in PT use, none of the studies have either used big data in developing agent-based individual model or implemented model in an open-sourced agent-based micro-simulation tool5. Therefore, advanced mathematical model, machine learning toolkit, and application of spatial statistics integrated with spatial analysis are needed to understand the interaction of PT user with the surroundings. Since, big data has the potentiality to provide both short-/long-term records, future research can evaluate alternative scenarios of different PT policy environment (e.g., before and after policy implementation), which will enable to understand the potential impacts of a PT policy4, 22.

Finally, apart from a few reviewed papers, big data has been used in transport research predominantly in context of developed countries, similar uses of such data in planning and operation of PT system is needed in developing countries, where the PT landscape is changing more rapidly.

It is expected that in future the accuracy and precision of the PT data will improve over time leading to fewer missing data and gaps in the trajectory. The improved data quality holds the promise of leading to more robust PT planning models. A more promising direction is, however, the emergence of multimodal data (e.g., data from ride-sharing modes, shared bikes). These emerging data sources are promising for PT planning from two aspects. Firstly, such data can be leveraged to optimize the transport network as a whole as opposed to PT only. Secondly, they can be used to infer the latent PT demand and take PT planning measures to maximize the revenue.

This review could be a useful guide for fellow researchers who intend to work with “big data” and “PT planning”, which will contribute in promoting PT. Despite the ever-increasing demand for car use, hopefully, academic research with big data will provide useful guideline on how to reduce car use considering the current situation of PT usage. Hence, exploring new methods and techniques is essential to employ big data in accurately explaining travel behavior and improving PT system.