1 Introduction

Taxi transport is an indispensable mode of transport in large cities. It complements other public transport modes in terms of flexible door-to-door services and 24/7 operations. The significance of the taxi industry is also reflected in its fleet size and the number of passengers served (Qian & Ukkusuri, 2015). By the end of 2020, for example, approximately 72,000 taxicabs were generating over 331 million trips in Beijing. However, balancing passenger demand and taxicab supply is a critical challenge for the taxi industry. Oversupply might lead to unwanted vacant trips and introduce negative externalities such as additional congestion and emissions (Çetin & Yasin Eryigit, 2011). In contrast, an inadequate supply may result in a lengthy wait time for passengers (Schaller, 2005). Understanding the spatial and temporal dynamics of taxi usage patterns is significant when the popularity of on-demand transport services increases and smart transportation is available (Wang & Ross, 2018).

In previous studies, two streams of investigation into usage patterns by transport mode have evolved. One stream compares the mode of choice among different transport modes, such as public transit, private cars, taxis, bicycles, and walking, using questionnaire survey data (Buehler, 2011; Du et al., 2021). The other stream compares usage patterns and ridership across space (Kang & Qin, 2016; Zhong et al., 2015). With the support of trip records, such as smart card data and taxi trajectory data, the spatial distribution of ridership has been examined along with the influencing factors. Both between-modal and within-modal comparisons are important to understand usage patterns of transport modes as well as examine ridership spatial distribution because both consider the trade-off with other travel modes when residents take taxis and indicate the irreplaceability of taxis. In addition, the temporal dynamics of taxi usage on a micro scale, such as hour by hour, is also a critical attribute because the spatial patterns of taxi usage vary by hour. However, the current literature has not sufficiently examined the usage pattern with both between-modal and within-modal comparisons from fine spatial and temporal perspectives. This might be due to the lack of multiple sources of data at a fine-grained scale, which is a useful methodology.

Recent studies have shown that sampling biases among multi-source data can be used to investigate taxi usage at a finer spatial and temporal resolution (Wang, Huang, & Du, 2020). Therefore, the research objectives of this study could be summarized as follows: (1) we propose an indicator to measure the taxi usage index using GPS-traced taxi trips and mobile signaling data for the same week in January 2017; (2) using multi-source data in Beijing, we measure the dynamics of the taxi usage index by hour in a daily cycle in 1 km × 1 km grid cells and explore the influencing factors associated with taxi usage from a spatiotemporal perspective.

2 Literature review

2.1 Taxi usage and mobility patterns

There are two streams of research on taxi usage. First is between-modal or within-modal comparison of taxi ridership among different transport modes, such as public transit, private cars, bicycles, and walking (Böcker et al., 2017; Buehler, 2011; Du et al., 2021). Under this stream, mode-choice prediction has received considerable research attention because it is an important component in travel demand estimation, transport planning, and urban planning (Yang et al., 2019). Traditionally, information on choice of travel mode is mainly collected from questionnaires. For example, Schwanen and Mokhtarian (2005) measured the travel mode choice of populations in the San Francisco Bay Area by asking 8000 randomly selected responders about “the frequency of travel by a certain mode and trip purpose.” Du et al. (2021) collected data on travel mode choices to seek healthcare by conducting a health-seeking behavior survey of 915 patients at nine top-tier hospitals in Beijing between August 1st and 17th, 2019. These studies reflected the static travel mode choice at the individual level, further using the questionnaire data.

The second stream examines the spatial dynamics of taxi usage patterns or taxi ridership. Qian and Ukkusuri (2015) explored the spatial variation of taxi ridership using large-scale New York City taxi data and associated taxi ridership with various socio-demographic and built-environment variables. Similar to taxi ridership, public transit ridership has also been studied by comparison over space. Bachir et al. (2019) proposed a method to infer dynamic origin-destination flows by transport mode using mobile phone data and observed different mobility patterns, by driving and by rail, between Paris and its suburbs. In addition, Liu et al. (2012) investigated the variations in both pick-ups and drop-offs using a taxi trajectory dataset collected in Shanghai and reported the underlying location difference, which reflects land usage patterns across the study area. In the era of big data, the usage of a transport mode by location can be evaluated using trip records from taxis, buses, and subways (Huang et al., 2019; Huang et al., 2021; Liu et al., 2021; Zhu et al., 2022).

Moreover, mobile phone data are widely used to investigate mobility patterns of the general population, and some achievements in this area have been made to date (Calabrese et al., 2013; Wesolowski et al., 2014). For example, Ratti et al. (2006) mapped the dynamics of urban activity in the metropolitan area of Milan, Italy using the Erlang values of cell phone stations. Candia et al. (2008) investigated calling activity patterns at the individual level and showed that the interevent time of consecutive calls is heavy-tailed. Yuan et al. (2012) provide a deeper understanding of how mobile phone usage correlates with individual travel behavior by exploring the correlation between mobile phone call frequencies and three indicators of travel behavior: radius, eccentricity, and entropy. Compared with transport records, such as taxi trajectory and smart card data, as well as mobile phone data can cover a wider group of people, representing the mobility pattern of the general population.

2.2 Factors associated with taxi usage

Factors associated with mode choice can be classified into three types: alternative transportation (e.g., private car, taxi, public transit); socioeconomic characteristics (e.g., income, age, employment); and built environment factors (e.g., road density, land use density, land use diversity) (Farinloye et al., 2019; Gan et al., 2020; Wang et al., 2022). First, public transit ridership and mode choice are closely related to the accessibility of alternative transportation (Du et al., 2021). For example, in a case study in Beijing, Wang et al. (2022) reported that the temporal and spatial dynamics of bus ridership are closely related to alternative transportation. Socioeconomic characteristics have been evaluated based on population, employment, and average household income, which usually have a positive correlation with public transit/taxi ridership (He et al., 2019; Zhao et al., 2013; Zhu et al., 2018). Among these factors, population is more closely related to the amount of travel demand, and employment can significantly influence taxi usage patterns. In addition, the average household income reflects affordability and affects taxi ridership. As average household income is related to privacy, house prices in the corresponding space are often used as proxies to reflect economic status (Freeman, 2018; Gan et al., 2020). In terms of built environment factors, previous studies have primarily focused on three dimensions: density, diversity, and design (Cervero & Kockelman, 1997). In addition, road and land use densities, land use diversity, and land-use types, such as residential, commercial, and industrial areas, are all useful variables in explaining taxi/public transit ridership and mode choice (Cao et al., 2009). As land-use type is difficult to quantify directly in most cities, the density of facilities, such as retailers, schools, banks, hospitals, hotels, and restaurants, is widely used as a proxy for land-use type (He et al., 2019).

In summary, previous studies have mainly focused on factors associated with the spatial distribution of taxi ridership. Little attention has been paid to the factors associated with temporal dynamics. In this study, taxi usage patterns were investigated through a comparison over space and across different transportation modes, and influencing factors were examined in a daily cycle, by hour. In addition, factors associated with taxi ridership are included in the reviewed literature.

2.3 Methods to explore factors related to taxi usage

Regarding the research methods, global models, including multiple linear regression with ordinary least squares (Chow et al., 2006), Poisson regression models (Kuby et al., 2004), Aggregate logit models (Taylor & Fink, 2013) and Tobit regression models (Chiou et al., 2015), have been widely used to explore the determinants of public transit and taxi ridership. As the dependent variable is a continuous variable and obeys a normal distribution after the transformation, multiple linear models estimated by ordinary least squares were employed to investigate the factors associated with taxi usage patterns.

3 Research design

3.1 Study area and data

The study area, comprising six central municipal districts of Beijing, is a major urban area in which most taxi trips are generated (Wang, Huang, & Du, 2020). To investigate taxi usage patterns at a fine spatial scale, we generated 1 km × 1 km grid cells that were compatible with the resolution of the data sources (Fig. 1). These, datasets regarding taxi trip data, mobile signaling data, points of interest (POIs), and other built-environment factors were then collected in the same grid cells. Overall, approximately 1500 grid cells were used in the analysis.

Fig. 1
figure 1

a Location of Beijing in China; b Location of the study area in Beijing; c Study area (the six central municipal districts of Beijing) and unit (1 km × 1 km grid cell)

3.1.1 Taxi trip data

Taxi is a widely used travel mode in urban settings. With the popularity of mobile phones, taxis can be booked online and ride-hailing has become common. The emergence of ride-hailing has solved the taxi shortage problem to some extent. As ride-hailing cannot be stopped and is more likely to be associated with crime, taxis are still more popular than ride-hailing. Therefore, this research collected the trajectories of taxis, which contain information on each taxi’s location, time, status (vacant or occupied), and speed every 10 seconds. We can identify the pick-up and drop-off points of each trip according to these continuous records to obtain the origins and destinations of the taxi trips. With a random sampling process that ensures representative spatial coverage and avoids confidentiality issues (Yao et al., 2019), we collected data for approximately 8.8 million taxi trips taken by 17,984 taxis during weekdays between January 9th and January 13th, 2017. It is worth noting that taxi trips on weekdays are only included in this study because residents’ trip purposes on weekdays and weekends vary considerably, and trips on weekdays are more likely to be associated with necessary activities. According to the origin and destination (i.e., OD) locations of taxi trips, the average numbers of departure and arrival trips in each cell were counted by hour. As the temporal dynamics during weekdays are different from those during weekends, this research analyzes taxi-based activity during weekdays. As shown in Fig. 2(a), on average, taxi trips occurred mainly between 07:00 and 21:00.

Fig. 2
figure 2

Distribution of travelers by hour between January 9th and January 13th, 2017 from taxi trajectory data (a) and mobile signaling data (b)

3.1.2 Mobile signaling data (MSD)

The original MSD recorded an anonymous and unique ID, time, and location for each mobile user. To address privacy concerns, mobile user locations were tracked into grid cells. The origin and destination grid cells of trips can be tracked when mobile users generate movements between different grid cells between two adjacent time intervals. Then we collected about 64.7 million OD pairs between January 9th and January 13th, 2017 (five weekdays) from China Unicom, one of the three largest telecommunication corporations. We counted the number of trips originating from each cell as well as the number of trips arriving at each cell so that each grid cell had two MSD attributes, namely, origins and destinations by hour. It is worth noting that OD pairs identified from MSD only indicate the movement of mobile users, but the mode of transport for each movement is unknown. In other words, these OD pairs represent trips for all transport modes. The general distribution of the origins by hour is presented in Fig. 2(b). The MSD origins are mainly concentrated between 08:00 and 18:00, with one morning and one evening peak identified. The difference between Fig. 2(a) and (b) indicates that quantitative gaps exist between taxi trips and MSD (trips by all modes) every hour; therefore, their spatial differences must be examined.

3.1.3 Variables selection

According to the reviews on factors associated with ridership in Section 2.2, alternative transportation, socio-economic factors, and built environment variables were selected in this study. First, taxi usage is influenced by the distribution of alternative modes of transportation, such as buses and subways. Here, bus service coverage and subway service coverage in each grid are used to represent alternative transportation.

Second, taxi usage is related to socioeconomic factors, such as affordability, population density, and employment. People’s affordability matters for taxi usage, as the monetary cost of traveling by a taxi is usually higher than that of other modes. In the absence of data on socioeconomic status, such as income, house prices are used to represent residence affordability. In this study, house price data were collected from the Soufang website (https://bj.fang.com/, accessed in October 2018), and the average house price for each grid cell was estimated based on the prices and locations of residences. Population refers to the population density, measured by the population scale in each grid, which was collected from the Beijing population census (2010), the latest census data obtained by the authors. Employment refers to employment density, measured by the employment scale in each grid, which was collected from the Beijing Economic Census (2014).

This research includes built environment factors related to land use, road areas, and facilities supporting daily life. Taxi usage is also associated with road network accessibility. Therefore, the road area in each grid was selected to represent the extent of the road network. The road network in Beijing, acquired from the OpenStreetMap website (http://www.openstreetmap.org/, accessed in October 2018), was used to calculate the length of roads in each grid cell. Land use diversity and density were calculated as the category and number of POIs in each grid, respectively. Recent studies (Gong et al., 2016) have incorporated POI data, indicating that urban dwellers visit with dataset attributes including latitude, longitude, name, and facility category. The POI categories include restaurants, recreation (e.g., bars, KTVs, parks, and gyms), enterprises, and residences. In this study, bus service coverage, subway service coverage, enterprises, residences, recreation facilities, and health facilities were counted in each 1 km × 1 km grid cell using Baidu Maps (2018). The specific variable descriptions of the grid cells are listed in Table 1.

Table 1 Description of influence factors

3.2 Methods

3.2.1 Taxi usage index

We propose a taxi usage index to quantify the concentration of taxi-based activities in a spatial unit (i.e., grid cell in this study). The taxi usage index measures the usage of taxis in two dimensions. First, it presents taxi usage in a grid cell compared with other grid cells, which presents a comparison in a spatial dimension. We defined a taxi trip ratio (TR) to calculate the number of taxis departing from a grid cell (Nte) over the total number of taxis departing from all grid cells studied (Ntw), indicating the relative degree of concentration. Second, the taxi usage index assesses taxi usage compared with other travel modes, as OD pairs measured by the MSD can present human movement in a general manner. In such a case, the taxi usage index measures usage in the dimension of travel mode. For all travel modes, the trip ratio (AR) is given by the number of trips departing from the grid cell ( Ne) over the total number of trips departing from all the grid cells studied (Nw). AR was designed to reveal the distribution of trip departures using a grid cell. The taxi usage index is given by the following equation.

$$Taxi\ usage\ index= TR/ AR$$
(1)

where TR = Nte/Ntw and AR = Ne/Nw. A higher taxi usage index implies a higher concentration of taxi-based activities when compared to other grid cells and other travel modes, and vice versa.

3.2.2 Multiple linear regression

Multiple linear regression is a statistical technique that uses several explanatory variables to predict the outcome of a response variable. The goal is to model the linear relationship between the independent and dependent variables. Here, we fit a multiple regression with a stepwise selection method to model the relationship between the taxi travel preference index and built environment factors:

$${I}_c={\beta}_0+\sum_{}{\beta}_i\ast {X}_i$$
(2)

where βi is the partial regression coefficient. Xi is shown in Table 1.

Before modeling, multicollinearity issues were examined by calculating the variance inflation factor between independent variables. We found that population, employment, land use density, land use diversity, public facilities, living facilities, government, sports and leisure, transportation hub, and financial area variables had multicollinearity issues with other variables, and were thus removed.

4 Results

In this section, we first explore the spatial pattern and spatiotemporal dynamics of the taxi usage index by calculating and visualizing its spatial pattern as 1 km × 1 km grid cells over 24 hours in a day. The number of trips from taxi trip data and mobile signaling data for each hour was calculated as the average from January 9th to January 13th, 2017. Then, an overall model and 24 multiple linear models for 24 hours were constructed and compared to examine the factors associated with the taxi usage index. As the smallest number of taxi trips at 2:00 am is approximately 40,000 trips, there are adequate trips and data records in the 24 models.

4.1 Spatial and temporal dynamics of taxi usage patterns

The spatial patterns of the taxi usage index by hour are shown in Fig. 3. Overall, the taxi usage index in the core areas was always higher than that in the periphery. This can be explained as follows: first, taxi services are convenient in core areas with more daily travel and higher taxi demand. Second, travelers in core areas are advantaged in terms of socioeconomic status, and are more likely to afford expensive travel costs (Huang et al., 2018).

Fig. 3
figure 3

Spatial patterns of the taxi usage index by hour

Temporal dynamics can be observed in the spatial patterns of the taxi usage index, and three periods in a day (24 hours) can be delineated. The first period was from 11:00 pm to 5:00 am the following day. During this period, Sanlitun, Tuanjiehu, Beixiaguan, Shaungjing, Niujie, Hepingli, and Zhanlanlu subdistricts showed a higher taxi usage index, which might be related to overtime journeys and recreational activities. The second period lasted from 5:00 am to 8:00 am. Grid cells with a high taxi usage index were relatively concentrated. They are mainly located in Wangjing, Taiyanggong, Hepingjie, Hepingli, Anzhen, Desheng, Xiangheyuan, Zuojiazhuang, Sanlituan, Maizidian, and Tuanjiehu subdistricts. The third period was from 8:00 am to 2:00 pm. The grid cells with high taxi usage indices were mainly concentrated in Zuojiazhuang, Sanlitun, Dongzhimen, Maizidian, Tuanjiehu, Hujialou, Jianwai, Jianguomen, and Chaowai subdistricts. During the rest of the time, grid cells with a high taxi usage index were relatively dispersed.

Rapid urbanization has led to a mixed distribution of facilities, making it difficult to distinguish among those that affect taxi usage. Several multiple linear regressions were employed to investigate the factors associated with the temporal dynamics of taxi usage.

4.2 Multiple linear regression results

Overall, independent variables performed well in explaining the taxi usage index, as the pseudo-R2 values of the overall model and 24 models were all over 0.4. Thereby, we summarized the results and discussed the regression coefficients of the factors associated with taxi usage (Tables 2 and 3).

Table 2 Results of Multiple Linear Regression on the origin side
Table 3 Coefficients of multiple linear regression in different hours

4.2.1 Multiple linear results for the overall taxi usage patterns

The results indicated that house price, road density, bus service coverage, subway service coverage, residence, and healthcare facilities are significantly associated with taxi usage patterns. Among these, there are positive correlations between housing prices and taxi usage patterns. This might be because taxis are more expensive than other transport modes. Travelers located in areas with high housing prices are usually economically advantaged (Freeman, 2018); thus, they are more likely to afford and choose taxis. Road density has a positive effect on taxi usage. This is because road density increases with road connectivity (Mo et al., 2018) and taxis are more likely to operate on roads with high connectivity. Bus and subway service coverage have a negative association with taxi usage patterns. Owing to the three travel modes in urban transportation, public transit and taxis can provide alternatives for passengers and share passenger flow among each other (Wang et al., 2022). We also find that enterprises and residences are positively associated with taxi usage patterns. An explanation for this finding is that areas with more enterprises and residences often have a more concentrated population, and thus may induce more travel demand overall (Liu et al., 2012). Furthermore, healthcare facilities are positively correlated with taxi usage patterns, which might also be related to the large travel demand for activities in these facilities. However, schools, as well as shopping, catering and recreational facilities, and tourism spots had no significant correlation with taxi usage patterns. These findings imply that taxi usage may vary depending on travel purposes.

Among the indicators significantly associated with the taxi usage index, the correlation coefficient of road density is the largest, followed by house price, bus service coverage, and subway service coverage. This might be because taxis tend to cruise where roads are dense, thus presenting a larger taxi usage index. Since taxis are an expensive mode of transport, the economic status of individuals in different spaces also contributes to taxi ridership. Whereas, taxi usage is negatively correlated with the density of bus and subway service coverage. In addition, there is a weaker correlation between the density of other facilities and taxi usage index. Among the facilities studied, the correlation between residence and the taxi usage index was the largest, followed by enterprises and healthcare facilities.

4.2.2 Temporal dynamics of regression coefficients

In general, the effects of bus service coverage, subway service coverage, enterprises, healthcare facilities, shopping spots, catering facilities, recreational facilities, and tourist spot variables showed significant temporal dynamics. Enterprises are positively associated with taxi usage patterns from 8:00 am to 5:00 am the next day, induced by business activities from enterprises to other facilities. From 9:00 am to 7:00 pm, the larger the number of healthcare facilities, the higher the taxi usage index. An explanation might be that taxis are like private cars, which are more convenient and comfortable than other transport modes, such as public transit and bicycles. Hence, for patients with physical discomfort, taxi is a preferential transport mode. Furthermore, taxis are also a better choice than private cars because most well-known hospitals in Beijing are located in the inner city where parking spaces are lacking (Wang, Du, et al., 2020). However, healthcare facilities had no significant correlation with the taxi usage index during their off-work hours (from 6:00 pm to 8:00 am the next day).

The recreational facility variable is positively related to the taxi usage index from 7:00 pm to 12:00 pm, whereas it is not significantly related to the taxi usage index during the remaining hours. This is reasonable because populations are more likely to visit recreational facilities during off-work hours on weekdays. The phenomenon in which many recreational trips in Beijing are serviced by taxis can be explained by the following reasons. First, recreational facilities are usually located in areas with a high development density and provide limited parking spaces. Second, activities in recreation facilities (e.g., bars) usually involve alcohol consumption, which increases taxi demand owing to drunk- driving restrictions in China. There was a positive correlation between tourist spots and taxi usage patterns from 9:00 am to 8:00 pm, which is related to the large number of tourists during this time.

However, the correlation between house prices, road density, residence, school, and taxi usage patterns does not have obvious temporal dynamics. Among these, house prices and road density are statistically significant at almost all hours. It can be reasoned that the economic status of individuals will always influence whether they choose taxis. There was no significant association between school and taxi usage patterns over any course of time. In conclusion, taxis are not the main mode of transportation for parents to take their children to and from school.

As we conducted the multiple linear analysis for 24 hours in a one-day cycle, the coefficients of independent variables at each hour were compared vertically. Thus, we can find the factors that have the largest correlation with the taxi usage index. The results indicated that the factors most associated with the taxi usage index varied in different periods. From 10:00 am to 6:00 pm, the correlation coefficients between house price, road density, and taxi usage index were the largest. From 6:00 pm to 10:00 am the next day, the correlations between residence and the taxi usage index are the largest, which is probably because it is when residents are at home and thus depart from home to other locations.

5 Conclusions and discussion

Comparisons among different transport modes over space offer significant dimensions for understanding taxi usage patterns, namely travel mode choice and transport ridership. However, the existing literature focuses only on one of these. Regarding related factors, existing studies have focused on those associated with transport ridership and travel mode choice, leaving the factors associated with the temporal dynamics of taxi usage patterns unexplored. To fill these gaps, we proposed a new indicator to measure taxi usage in a grid of 1 km × 1 km cells in Beijing, utilizing multi-source data harvested in the same week. Moreover, we examined the spatiotemporal dynamics of the taxi usage index in a 24-hour cycle and explained the driving factors behind these dynamics with alternative transport, socioeconomic factors, and built environment factors. Overall, this research offers an example of utilizing big data in the study of travel behavior with a novel research framework. The calculation method for the taxi usage index can also be extended to other cities and cases.

Some policy implications can be generated from this study. First, taxi ridership should not be solely considered when allocating taxi services, and taxi usage interacting with other travel modes should be considered in transportation management. Second, the temporal dynamics of the factors influencing taxi usage should be included in taxi ridership sourcing. Factors such as residence, enterprise, healthcare facility, shopping spot, catering facility, recreation facility, and tourist spot had different associations with taxi usage patterns in different time, while others such as house price, road density, residence, and school did not. These findings imply that we should consider the influence of the built environment differently; specifically, taxi cruises around residences all day, while more taxi services are advised to concentrate on enterprises, healthcare facilities, and tourism spots in the daytime. In addition, more taxi services should be in areas with recreation facilities, catering facilities, and shopping spots in the evening.

However, one limitation was the POI data, which could not reflect the scale of each point. Hence, investigation of the density and areas of the facilities was limited. In addition, the Beijing census data were collected in 2010, which has a 7-year lag with taxi trip data and mobile signaling data. However, it is the most recent data that we could collect, as the census in China is carried out every 10 years. In future work, more evidence on the spatiotemporal dynamics of the taxi usage index and its related factors will be explored and compared to summarize and extract general findings. In addition, taxi usage often presents significantly different patterns on weekdays and weekends. This study mainly focuses on the temporal dynamics of taxi usage, and the disparity in taxi usage between weekdays and weekends will be further explored.