1 Introduction

Public bus service plays an important and irreplaceable role in the traffic system of a city. On one hand, public bus is one of the most convenient and cost-efficient ways for people to travel. On the other hand, public bus service, as an alternative to the use of private cars, is an effective way to reduce carbon dioxide emissions. Promoting the bus service not only provides convenience for people but it also improves the urban living conditions and helps in the fight against the climate change. On-time bus is an emerging bus running mode where the bus arrives at each bus stop according to the time table strictly, which is helpful for passengers to avoid wasting too much time at bus stops. There are several countries, e.g., the USA, Finland, and Japan, where efforts have been made to implement on-time bus running mode to improve the bus service quality.

Clearly, a reasonable design of bus running routes and time tables is the key to carry out the on-time bus running mode. To initiate this development, practical bus journey data including the arrival information of a bus at every stop should be collected at first. The developments of sensors and Internet of Things enable the automatic collection of bus journey data. For example, the local government of Tampere, Finland, started in 2015 to record and publish as open data the location information of each running bus real-time every second. Particularly, the essential bus journey data include information on, for each running bus at each bus stop location, whether the bus was on schedule at a stop and how long was the delay if it was not on time. Table 1 lists several samples of Tampere bus journey data. There, each record contains the information of a bus arriving at a bus stop.

Example 1

In Table 1, “Stop” represents the bus stop ID, “Name” is the name of the bus stop, “Latitude” and “Longitude” indicate the location of a bus stop, “Line” is the bus line number, “Date” and “Arrival” are the date and time of bus arrival, respectively, and “Delay” is the difference between the arrival time and the scheduled time (‘−’ indicates that bus arrives ahead of scheduled time). For instance, from the first record in Table 1, we can see that a bus on Line 13 arrived at Stop Ammattikoulu (ID: 1500, location: 61.4986\(^{\prime \prime }\)N, 23.7355\(^{\prime \prime }\)E) at 16:40:15 PM on August 3, 2016, being 210 s behind the schedule.

Table 1 Instances of bus journey record

A bus non-on-time query provides statistics on the non-on-time arrivals of buses that happened under given conditions. Bus non-on-time queries are useful for the bus route design, online information systems on bus transportation, and the bus running schedule optimization. In general, the non-on-time events can be divided into the early case (the arrival time is ahead of the scheduled time) and the delaying case (the arrival time is later than the scheduled time). Please note that the main concern of bus non-on-time query is the delaying case, since (i) the early case can be easily avoided by slowing or stopping buses in the real-life situation, and (ii) the techniques supporting the query for delaying cases can be used for early cases. Thus, for the sake of simplicity, we just discuss the delaying case hereafter. However, the results generalize to the early case.

Typical bus non-on-time queries [15, 18] include: “How many bus delaying events happened over a given time period?”, “Which are the routes where most delaying events happen?”, and “Do the bus delaying events aggregate and where?”, etc. Based on the query conditions, there are three kinds of bus non-on-time queries:

  • Temporal queries, in which the query conditions are related to bus running time. For example, how many buses were delayed during \(17:00 \sim 19:00\)?

  • Spatial queries, in which the query conditions are related to bus stop locations. For example, given a bus stop ID “560”, which are the nearest neighboring bus stops where delaying events happen?

  • Spatial–temporal queries, in which the query conditions are related to both bus running time and bus stop locations. For example, which are the spatial–temporal zones with significant aggregation of delaying events?

To the best of our knowledge, there is no data management model dealing with the bus non-on-time queries.Footnote 1

Since non-on-time query over bus journey data can be used to improve bus service, thus improving comfortable travel experience, we should provide an efficient model to manage bus journey data. However, there are some challenges that we need to address: (i) How to manage the bus journey data? (ii) How to fuse data from different sources? (iii) How to perform queries efficiently? In this paper, to tackle these challenges related to non-on-time queries, we propose a model named Bus-OLAP to manage bus journey data.

In this paper, we make the following concrete contributions: (i) We build indexes of bus journey data by bit-vectors according to the application scenarios, taking into account particularly the temporal and spatial factors. (ii) We introduce efficient index operations to support the non-on-time queries on bus journey data. (iii) We implement the related computations on Spark to support distributed computing for the processing of the queries, thus enabling real-time response for the analysis for massive datasets.

The rest of the paper is organized as follows. We review the related work in Sect. 2 and present the critical techniques of Bus-OLAP in Sect. 3. In Sect. 4, we report a systematic empirical study using real-world bus data and we conclude the paper in Sect. 5.

2 Related Work

Our study is related to previous studies on urban traffic analysis, spatial–temporal data management, and parallel computing.

2.1 Urban Traffic Analysis

The analysis on urban traffic plays an important role and attracts extensive attention from public [25]. Traffic analysis helps to alleviate urban traffic congestion thereby improving environmentally friendly traffic for people to travel. Traffic analysis has been recently under active research. Yuan et al. [22] analyze the movements of taxis and passengers and provide an easy way for passengers to pick taxis. Pang et al. [10] detect the anomalous behavior of taxis in Beijing metropolitan area. Chen et al. [2] propose a model given the past trajectories and predict the next station of an object. Kong et al. [8] aim to deal with the problem of traffic congestion and propose a method to predict the traffic congestion using the floating car trajectories data. Han and Moutarde [6] focus on traffic dynamic prediction in urban transportation network and find out the evolution of traffic states, which contributes to the adjustment of traffic management. Zhang et al. [24] apply deep learning to predict the traffic of crowds in different regions of a city. Bao et al. [1] propose a data-driven approach to develop bike lane construction plans satisfying the constraints and objectives requested by the user. There are several works on predicting the urban traffic flow by the traffic state [4, 13, 17]. Wu et al. [18] detect the temporal–spatial aggregation of bus delay but do not consider the management of bus data.

To the best of our knowledge, the existing work about traffic analysis concentrates on the problems with vehicles with stochastic trajectories. However, the analysis on bus data is essentially different from on other vehicles. On one hand, the trajectory and schedule of each bus are fixed. On the other hand, the main concern on bus journey data is non-on-time analysis. Clearly, the existing methods proposed by previous work on traffic analysis are not applicable to online bus journey data analysis, and, in particular, they do not give an efficient method to manage large bus data sets for efficient online analysis.

We tackled the problem of bus journey data management in Pang et al. [11], a preliminary version of this paper. In comparison with that work, we have now extended the paper in the following aspects: (i) adding more introduction to the background of the research on bus journey data collection and analysis; (ii) including the necessary background knowledge about bus journey data preprocessing in Sect. 3.1; (iii) presenting a method of querying data from different sources; (iv) providing a more detailed description of key techniques for bus non-on-time query in Bus-OLAP; and (v) performing more extensive empirical evaluations.

2.2 Spatial–Temporal Data Management

Urban traffic data have spatial–temporal characteristics. The bus data management can improve the efficiency of storage, indexing, and querying of the data. Sistla et al. [12] propose a model, called MOST, to model the moving object with a time-related function, which improves the efficiency of storage and querying of moving objects in a database. Ting et al. [16] present a simplistic network model for moving objects. Some index structures such as R-tree [5] and B\(^+\)-tree [7] are also used to optimize spatial–temporal queries. Yu et al. [21] propose an algorithm, named YPK-CNN, to monitor KNN queries.

The works discussed above can improve the query efficiency by different ways. However, we want to provide a model by which all bus running related factors can be considered. For example, besides the factors listed in Table 1, some external factors, such as weather activity and temperature, also affect the on-time rate of bus service. In particular, we need a data management model that is suitable for a distributed computing environment.

2.3 Parallel Computing

Considering the requirement of analyzing large-scale bus journey data, parallel computing techniques are necessary to speed up the efficiency of spatial–temporal analysis [9]. A lot of work has been done on adapting parallel computing methods to support efficient queries over large-scale data. Xia et al. [19] design a method to conduct KNN queries based on Map-Reduce and apply it to real-time prediction of traffic flow. Eldawy et al. [3] have built a system, named GeoSpark, based on Spark to improve the efficiency of spatial–temporal data analysis. Xie et al. [20] apply R-tree indexing to parallel queries based on Spark, to deal with the efficiency issues of large-scale data.

In this work, we use Spark to build a parallel processing model based on bit-vector operations to conduct bus non-on-time queries.

3 Design of Bus-OLAP

In this section, we present our Bus-OLAP data management model for bus non-on-time queries. The framework of Bus-OLAP (Fig. 1) consists of data cleaning, transforming, loading, and querying. The preprocessing of bus journey data has been well studied in [14]. However, for the sake of self-completeness of our presentation, we briefly present the key points for data preprocessing here. Then, we discuss the most important techniques used in Bus-OLAP: (i) data indexing, (ii) index operations, and (iii) parallel query computation.

Fig. 1
figure 1

Framework of Bus-OLAP

3.1 Data Preprocessing

The raw bus journey data suffer from noise, inconsistencies, and missing observations [15]. This is due to the real-world situations, such as the location measurement inaccuracies of the global navigation satellite system, imprecise knowledge of the real observation time, and errors in data identification. Several methods had been used to clean the data. The noisy data can be identified as outliers by comparing with other data. The inconsistent data can be discarded with the help of known scheduled bus stop sequences. The missing data can be ignored, since there is typically enough data available for the analysis purposes.

Millions of observations are stored every day. However, the observations are not useful as such for statistical analysis. As a result, they are grouped into journeys, so that the essential bus running information, e.g., the bus route and the arrival time at each bus stop during one journey, can be studied. Furthermore, all bus journeys are segmented into links between subsequent bus stops, since the bus stop sequences and the location of each bus stop are available, and the segments between every two bus stops are useful for bus traffic analysis.

Example 2

In Fig. 2, there are two buses \(v_1\) and \(v_2\), three bus stops and four observations. Bus \(v_1\) arrived at the bus stop “Rautatieasema” at 12:00 and then arrived at “Tammelan puistokatu” 3 min later. Thus, we grouped the first and third observation together because they are in the same journey of bus \(v_1\). Similarly, the second and last observations in Fig. 2 belong to a journey of bus \(v_2\). Furthermore, the link between bus stops “Rautatieasema” and “Tammelan puistokatu” is a segment of \(v_1\)’s journey.

Fig. 2
figure 2

An example of bus journey

3.2 Data Indexing

To support efficient bus non-on-time queries, we implement an index based on attribute domain partition in Bus-OLAP. Specifically, we divide the domain of each attribute into several disjoint partitions. Then, for each attribute value of a record, we build a bit-vector to indicate the partition it belongs to. Thus, we can apply bit operations to perform bus non-on-time queries, which improves the query efficiency.

Next, we introduce the details of the index building. For attribute A, the domain of A, denoted by \({\mathcal {D}}(A)\), is the set of all available values on attribute A. A partition of \({\mathcal {D}}(A)\), denoted by \({\mathcal {P_D}}(A)\), is a collection of non-empty subsets of \({\mathcal {D}}(A)\) such that (i) \({\mathcal {D}}(A) = \bigcup \nolimits _{d_i \in {\mathcal {P_D}}(A)} d_i\), and (ii) for \(\forall d_i, d_j \in {\mathcal {P_D}}(A)\), \(i \ne j\), \(d_i \cap d_j = \emptyset\). Please note that, in Bus-OLAP, the order of elements in \({\mathcal {P_D}}(A)\) is predetermined and fixed during performing queries. We denote by \({\mathcal {P_D}}(A)_{[i]}\) the ith element in \({\mathcal {P_D}}(A)\). Furthermore, there may be several different partitions for an attribute. For example, the partitioning of \({\mathcal {D}}({\text{``Date''}})\) can be by month or by season.

For a record r, we denote by r.A the value of r on attribute A. Given a partition \({\mathcal {P_D}}(A)\) on attribute A, the index of r.A is a bit-vector \(<b_1, b_2, \ldots , b_{|{\mathcal {P_D}}(A)|}>\) satisfying

$$\begin{aligned} \begin{aligned} b_i = {\left\{ \begin{array}{ll} 1,&{} r.A \in {\mathcal {P_D}}(A)_{[i]}\\ 0,&{} r.A \notin {\mathcal {P_D}}(A)_{[i]} \end{array}\right. } \end{aligned} \end{aligned}$$
(1)
Fig. 3
figure 3

An example of a partition on attribute “Date”

Example 3

A partition on attribute “Date” according to day of the week is illustrated in Fig. 3. The domain of “Date” is partitioned into seven subdomains, corresponding to the seven days of a week. The day “2016-08-03” in Table 1 represents August 3, 2016, and it is Wednesday. Therefore, the index of first record in Table 1 is \(<0, 0, 1, 0, 0, 0, 0>\) under partition \({\mathcal {P_D}}({\text{``Date''}})\).

Table 2 Attribute domain partitions in Bus-OLAP

In Bus-OLAP, the selection of attribute partitions depends on:

  • The query requirements. For example, we partition the domain of “Date” by day of the week and the domain of “Arrival” by time interval, so that we can find stops where most delaying events occur during the morning rush hour on Monday.

  • The partition size. For example, we do not divide the domain of “Date” by date because the statistics on the non-on-time events within one day has little significance, and the space requirement following from the amount of subdomains (i.e., \(|{\mathcal {P_D}}({\text{``Date''}})|\)) would be large.

Table 2 lists the domain partition criteria for the main attributes listed in Table 1 by considering the application requirements.

For a partition \({\mathcal {P_D}}(A)\) on attribute A, the index of all records is the set \(V_A = \{v_1, v_2, \ldots , v_{|{\mathcal {P_D}}(A)|}\}\), where \(v_i \in V_A\) (\(1 \le i \le n\)) is the column vector with respect to \({\mathcal {P_D}}(A)_{[i]}\). If the value of the kth element of \(v_i\) is 1, the value of the kth record on attribute A belongs to \({\mathcal {P_D}}(A)_{[i]}\).

Example 4

In Fig. 3, the column vector corresponding to “Wed” is \(v = (1, 0, 0, 0)^\mathrm{T}\). The first element of v (i.e., 1) means that the value of the first record in Table 1 on attribute “Date” belongs to “Wed”.

As discussed above, the non-on-time query over bus journey data can be easily implemented by vector operations. For a bitmap \(V_A = \{v_1, v_2, \ldots , v_n\}\), we store \(v_i \in V_A\) as a bit-vector. However, instead of loading all indexes, Bus-OLAP only loads the indexes (i.e., bit-vectors) of query-concerned attributes into memory. Then, a query can be converted into bit-vector operations.

For a new bus journey record, the index can be efficiently updated by appending a new bit at the end of each bit-vector.

Fig. 4
figure 4

An example of attribute combination query

3.3 Index Operations

For the sake of efficiency, we introduce some operations on indexes (bit-vectors) in Bus-OLAP. For a partition \({\mathcal {P_D}}(A)\), we define the subpartition of \({\mathcal {P_D}}(A)\) as \(\hat{{\mathcal {P_D}}}(A)\), where \(\hat{\mathcal {P_D}}(A) \subseteq {\mathcal {P_D}}(A)\). The set of bit-vectors corresponding to \(\hat{\mathcal {P_D}}(A)\) is \(\hat{V}_A = \{v_i \in V_A \mid \hat{\mathcal {P_D}}(A)_{[i]} \in {\mathcal {P_D}}(A)\}\). We define the OR operation on \(\hat{V}_A\) as

$$\begin{aligned} \mathrm{OR}(\hat{V}_A) = v_1' \mid v_2' \mid \dots \mid v_m', \end{aligned}$$

where \(v_i' \in \hat{V}_A\) (\(1 \le i \le m\)) and “\(\mid\)” denotes the bitwise OR operation between every two bit-vectors.

Clearly, the result of \(\mathrm{OR}(\hat{V}_A)\) is also a bit-vector. The records satisfying subpartition \(\hat{\mathcal {P_D}}(A)\) can be obtained by index operations.

Example 5

Consider Fig. 3, and suppose the query target is finding the records generated in weekdays (from Monday to Friday). Firstly, we fetch the subpartition \(\hat{\mathcal {P_D}}({\text{``Date''}}) = \{\text {Mon}, \text {Tue}, \text {Wed}, \text {Thu}, \text {Fri}\}\) from \({\mathcal {P_D}}({\text{``Date''}})\). As a result, the bit-vector set with respect to \(\hat{\mathcal {P_D}}({\text{``Date''}})\) is \({\hat{V}_{\text{``Date''}}} = \{v_1, v_2, v_3, v_4, v_5\}\), where \(v_1 = (0, 0, 0, 0)^\mathrm{T}\), \(v_2 = (0, 0, 0, 0)^\mathrm{T}\), \(v_3 = (1, 0, 0, 0)^\mathrm{T}\), \(v_4 = (0, 1, 0, 0)^\mathrm{T}\), and \(v_5 = (0, 0, 1, 0)^\mathrm{T}\). Then, \(\mathrm{OR}({V_{\text{``Date''}}}) = v_1 \mid v_2 \mid v_3 \mid v_4 \mid v_5 = (1, 1, 1, 0)^\mathrm{T}\). Thus, the records satisfying the query conditions are the first three records in Table 1.

In addition, we define the bitwise AND operation (denoted by&) for two bit-vectors belonging to different subpartitions, so that the query involving conditions on different attributes can be performed by index operations.

Example 6

Consider the example in Fig. 4. Suppose that the query target is the set of records whose “Longitude” ranges from 23.73 to 23.74, “Latitude” ranges from 61.48 to 61.50, and “Date” is Wednesday or Thursday. Then, the subpartitions of “Longitude,” “Latitude,” and “Date” are \(\{\text {o}_1\}\), \(\{\text {a}_3, {\text {a}_4\}}\), and \(\{\text {Wed}, \text {Thu}\}\), respectively. The query result illustrated as the red area in Fig. 4 can be obtained by the following index operation: \(\mathrm{OR}(\{a_3, a_4\}) \& \mathrm{OR}(\{o_1\}) \& \mathrm{OR}(\{\text {Wed}, \text {Thu}\})\).

Fig. 5
figure 5

Framework of distributed query and computing. There are four steps: ① loading vectors into memory according to the query conditions; ② mapping the conditions into tuples \(<c_1, c_2, \dots , c_n>,\) where \(c_i\), \(0 \le i \le n\), is the query condition; ③ partitioning the data into nodes and performing the bit operations on vectors by Spark; ④ writing the output of the query result

3.4 Parallel Query Computation

Bus non-on-time query is computation intensive and response time sensitive. Figure 5 shows the framework of parallel query computation by bit-vectors using Spark [23]. Next, we introduce the details of parallel query computation by three typical kinds of queries. Please note that the queries to be discussed below are typical application scenarios for Bus-OLAP. However, it is easy to perform more complex queries using index operations with Spark. All bus non-on-time queries are performed using the framework shown in Fig. 5. Specifically, Bus-OLAP takes the query condition as input. In the query process, Bus-OLAP firstly maps the conditions into tuples. Secondly, for each condition tuple, Bus-OLAP conducts the bit operations on vectors using parallel computation with Spark. Finally, Bus-OLAP returns the results that satisfy the query conditions.

3.4.1 Temporal Queries

The typical temporal queries include querying the number of non-on-time events over a period of time and querying the routes that have the most non-on-time events over a period of time.

Consider a query on the number of non-on-time events over a period of time, given time interval t, and non-on-time condition z (ahead of time, on-time, or delayed). The corresponding condition tuple in Fig. 5 is \(<t, z>\). Please note that there may be many condition tuples due to the number of time intervals. We compute each condition tuple in a parallel, using the computational model of Spark. For each condition tuple, the query result is obtained by index operations on bit-vectors. The number of non-on-time events equals the number of bits set to 1 in the result vector.

Similarly, the query on the route that has the most non-on-time events over a period of time can be easily implemented. Given query conditions with respect to time interval t, non-on-time condition z, and bus line l, the computation on each condition tuple \(<t, z, l>\) returns the number of delaying events. Then, the query result is the bus line that has the maximum number of delaying events over all tuple results.

Fig. 6
figure 6

An example of KNN query

3.4.2 Spatial Queries

There are some typical spatial queries, such as KNN and RKNN, that can be applied in the bus data in the following way. Given a bus stop q with a number of delaying events, the analyst may want to know the bus stops close to q and analyze the factors related to the delays.

After partitioning attributes “Longitude” and “Latitude,” we get a division of space into a grid. Each bus stop is located in a grid. For a bus stop q, the main steps to find the k nearest bus stops are: (i) search the candidate bus stops by extending a q-centered rectangular space that consists of grids; (ii) once there are at least k bus stops contained in the rectangular space, calculate the distance from q to the kth nearest stop p (denoted by \({\text {dis}}(q, p)\)); (iii) find k nearest neighboring stops within a q-centered circle space with radius \({\text {dis}}(p, q)\).

Example 7

Consider bus stop q in Fig. 6. To find five bus stops nearest to q, the first step is extending the q-centered rectangle. As \(R_1\) is the first (smallest) rectangle containing 5 bus stops, and \(p_5\) is the fifth nearest bus stop to q within \(R_1\), stops located in the q-centered circle with radius \({\text {dis}}(q, p_5)\) are the candidates for the query results (5 nearest neighboring stops to q).

Considering the characteristics of index operations, it is easy to find bus stops located within a rectangle area compared to a circle one. Thus, in Bus-OLAP, for stop q in Fig. 6, we find the KNN stops of q by checking the stops located within the smallest rectangle containing the q-centered circle with radius \({\text {dis}}(q, p_5)\), i.e., \(R_2\). Please note that it is easy to find stops located in \(R_2\) by index operations based on \(R_1\). For rectangle \(R_1\) in Fig. 6, the index operation on Longitude \(\mathrm{op}_\mathrm{lon} = \mathrm{OR}(\{o_2, o_3, o_4, o_5, o_6\})\) and on Latitude \(\mathrm{op}_\mathrm{lat} = \mathrm{OR}(\{a_2, a_3, a_4, a_5, a_6\})\). After extending the rectangular area from \(R_1\) to \(R_2\), two partitions \(o_1\) and \(o_7\) are added into the Longitude of \(R_1\). Similarly, the two partitions \(a_1\) and \(a_7\) are added into the Latitude of \(R_1\). Consequently, the new result on Longitude is calculated as \(\mathrm{op}_\mathrm{lon} \mid \mathrm{OR}(\{o_1, o_7\})\) and the new result on Latitude is calculated as \(\mathrm{op}_\mathrm{lat} \mid \mathrm{OR}(\{a_1, a_7\})\), respectively. Obviously, the search area can be efficiently extended by performing index operations on bit-vectors iteratively.

The RKNN query can be used to find all bus stops whose KNN stops include the given bus stop q. We implement the query using parallel computation on Spark. In the framework of Fig. 5, the condition tuple is \(<q, p, k>,\) where q is the given bus stop. Spark is used to compute each tuple, returning the result indicating whether q is one of the KNN stops of p. The RKNN query result is a set of bus stops that satisfy the query condition.

Fig. 7
figure 7

Spark process of delay aggregation query

3.4.3 Spatial–Temporal Queries

Detecting the spatial–temporal zone of bus delay aggregation can be done by querying the zone where delay aggregation occurs in a period of time. We measure the significance of spatial–temporal aggregation delay by log-likelihood ratio (LLR). Given a zone S and a time interval T, the log-likelihood of bus delay taking place, denoted as \({\mathcal {L}}(S, T)\), is

$$\begin{aligned} \begin{aligned} {\mathcal {L}}(S, T) = {\left\{ \begin{array}{ll} {\mathcal {D}}(S, T) \times \log (r_1) + ({\mathcal {D}}(\widetilde{S}, \widetilde{T}) - {\mathcal {D}}(S, T))\times \log (r_2)\\ - {\mathcal {D}}(\widetilde{S}, \widetilde{T}) \times \log \frac{{\mathcal {D}}(\widetilde{S}, \widetilde{T})}{\mathcal {N}(\widetilde{S}, \widetilde{T})},\quad r_1 > r_2\\ 0,\quad r_1 \le r_2 \end{array}\right. } \end{aligned} \end{aligned}$$
(2)

where

$$\begin{aligned} r_1 = \frac{{\mathcal {D}}(S,T)}{\mathcal {N}(S, T)}, r_2 = \frac{{\mathcal {D}}(\widetilde{S}, \widetilde{T}) - {\mathcal {D}}(S, T)}{\mathcal {N}(\widetilde{S}, \widetilde{T}) - \mathcal {N}(S, T)} \end{aligned}$$

where \(\widetilde{S}\) and \(\widetilde{T}\) are the maximal zone and maximal time interval, respectively; S and T are the observed zone and time interval, respectively; \(\mathcal {N}(S, T)\) denotes the total number of bus arrivals within S and T; and \({\mathcal {D}}(S, T)\) denotes the number of delaying events within S and T.

Our target is to find the zone during a time interval that has maximal LLR in bus delay aggregation analysis. Figure 7 illustrates the processing of the delay aggregation query with Spark. Step 1 in Fig. 7 corresponds to Step 2 in Fig. 5, and Steps 2 and 3 in Fig. 7 correspond to Step 3 in Fig. 5. Bus-OLAP takes query conditions as input. Firstly, Step 1 joins the query conditions into condition tuples \(<o, a, d, t, de>\), where the variables denote “Longitude,” “Latitude,” “Date,” “Time,” and “Delay,” respectively. Secondly, Step 2 partitions tuples to nodes of the Spark cluster and gets the new tuples of LLR. Finally, Step 3 searches the tuple with maximal LLR, which is the query result.

3.5 Query with Data from Different Sources

The on-time rate of bus service is vulnerable to the impact of external factors. For example, the delaying events happen more frequently in rainy days and foggy days. Clearly, it is interesting and useful to consider various external factors, e.g., weather, when performing bus non-on-time analysis. To this end, we introduce a flexible implementation in Bus-OLAP such that queries containing conditions on external factors can be efficiently performed.

It is challenging to query with external factors, since the external factors are extracted from different sources of data. For example, the bus journey records shown in Table 1 do not contain any weather information. If users want to investigate the impact of weather on bus delay, weather information should be collected at first. Technically, there are two steps to support queries containing data from different sources.

  • The first step is fusing external factors with bus journey records. In other words, more attributes with respect to external factors are added to bus journey records.

  • The second step is building indexes for the external factors in order to support efficient query.

Considering that changes in weather may significantly impact the bus on-time rate, we collect weather data from Weather Underground, which is a Web site sharing weather information of global cities.Footnote 2 Next, we introduce our method to fuse external factors with bus journey records for bus non-on-time queries by taking weather data as an example.

To give an example, we collect Tampere’s local average daily temperature and daily weather activity from Weather Underground. The reasons we collect these two factors are that (i) average daily temperature and daily weather activity are two main features of weather status, and (ii) the data types of these two factors are typical. The domain of average daily temperature is continuous, while the domain of daily weather activity containing “rain,” “fog,” and “thunder” is enumerable. (Here, we only consider the weather activities that affect the bus on-time rate negatively.)

Please recall that, as listed in Table 1, the bus journey records contain temporal information (i.e., date and bus arrival time). As both average daily temperature and daily weather activity are time related, it is easy to associate each bus journey record with the average temperature and weather activity in that moment.

Example 8

On August 5, 2015, at Tampere city, the average daily temperature was \(15\,^{\circ }\hbox {C}\) and the weather was rainy. Then, for the 3rd record listed in Table 1, we fuse it with weather factors (in bold font) as follows.

Stop

Name

Lat.

Long.

Line

Date

Arrival

Delay

Temp

Activity

3084

Kuoppa maentie

61.4799

23.8046

560

2016-08-05

00:09:48

−34

15

Rain

For the sake of query efficiency, it is necessary to build indexes for the external factors. Similar to building indexes for bus journey data (Sect. 3.2), for each external factor, we build corresponding indexes based on the domain partition. For an external factor whose data type is continuous (e.g., average daily temperature), the index is built like building the index on “Delay”. Specifically, for each bus journey record, we construct a bit-vector indicating the interval that the value of average daily temperature locates in. For an external factor whose data type is enumerable (e.g., daily weather activity), the index is built like building the index on “day of the week”. Specifically, we use a set of bit-vectors to indicate the values of daily weather for all bus journey records.

4 Empirical Evaluation

In this section, we report a systematic empirical study on a real-world bus data set from Tampere city.Footnote 3 The data set includes the bus stop information of Tampere, 43 bus routes, 75 bus stops of a line on average, and 6,297,520 records generated from all weekdays from August 1, 2015, to October 30, 2015. We also collected daily weather activity and average daily temperature for each record from Weather Underground Web site.

All algorithms are implemented in Java and compiled using JDK 7. The distributed environment is built using Spark 1.6.2 and includes eight nodes. Each node is a computer with an Intel Core i7-6700 3.40 GHz CPU, and 64 GB main memory, running Ubuntu 14.04 operating system.

Figure 8 illustrates the quantitative relationships between the numbers of non-on-time arrivals with different time intervals (measured in minutes) ahead or delayed. From Fig. 8, it is interesting to see that most (over \(75\%\)) of non-on-time arrivals happened within \(\pm \,3\) min compared to the scheduled time. Specifically, for the early case, there are over \(87\%\) arrivals arriving slightly ahead of schedule. And for the delaying case, there are \(77\%\) arrivals arriving slightly behind the schedule. Considering the complex situations in the real world, it is impractical to expect that every bus arrives on-time exactly. Naturally, we treat bus arrivals with slight time ahead or delayed as normal (on-time) cases. Thus, in our empirical study, we label each bus arrival as a non-on-time event including “early case” or “delaying case”, if the value of “Delay” is \(<-3\) min (“early case”) or \(>3\) min (“delaying case”).

Fig. 8
figure 8

Distributions of the delay (minutes) of non-on-time arrivals, a early case, b delaying case

Figure 9a illustrates the number of days with respect to certain weather. The daily weather contained in weather data include: rain, fog, and thunder. We can see that the rainy weather is more frequent than other weather activities. As the number of days with thunder (i.e., 3) is too small, we only consider rain and fog in this empirical study. Figure 9b shows the change of daily temperature. The highest temperature is \(19\,^{\circ }\hbox {C}\) on August 15, and the lowest temperature is \(-\,2\,^{\circ }\hbox {C}\) on October 9, 28, and 30. There are 51 such days that the average daily temperature is above \(10\,^{\circ }\hbox {C}\).

Fig. 9
figure 9

Statistics on daily weather activities and average daily temperature change a weather activities and b temperature change

Fig. 10
figure 10

Number of non-on-time events a Aug, b Sept, and c Oct

4.1 Effectiveness

We verify the effectiveness of Bus-OLAP using three kinds of typical queries: temporal queries, spatial queries, and spatial–temporal queries. For the temporal queries, there are two frequent queries: (1) What is the number of delaying events in every weekday? (2) Which are the routes that have the most delaying events?

Figure 10 illustrates the numbers of non-on-time arrival events on weekdays in August, September, and October, respectively. As we stated in Sect. 1, the main concern of bus non-on-time query is the delaying case, and in Fig. 10, we can see that the number of delaying cases is much larger than the number of early cases. It is interesting to see that the weekday with the maximum number of non-on-time events keeps changing from August to October. For example, in August, the number of delaying cases on Monday is the largest, in September, the largest number is on Tuesday, while in October, the largest one is on Thursday.

Table 3 lists the top 3 routes where delaying events occur frequently during different time intervals. It is interesting to see that Lines 13, 8, and 3 are the routes with the largest numbers of delaying events, especially in time interval [14:00, 18:00), over the three months. We also find that the lines that have frequently delaying events during different periods are different. One possible explanation is that the direction of crowd flow in the morning rush hour is different from that in the evening rush hour. The buses were delayed due to the getting in and getting off time by large numbers of passengers.

Table 3 Bus lines with the largest number of delaying events during different time intervals
Fig. 11
figure 11

Illustration of KNN queries (\(k = 8\)), a stop “Yliopistonkatu”, b stop “Villa Viola”

Fig. 12
figure 12

Illustration of RKNN queries (\(k = 8\)), a stop “Yliopistonkatu”, b Stop “Villa Viola”

Figures 11 and 12 illustrate the query results of spatial queries, i.e., KNN and RKNN, on two bus stops, respectively. The two target bus stops are “Yliopistonkatu”(ID: 560) and “Villa Viola” (ID: 700). In Fig. 11, the red points are the query bus stops, while the blue points are the eight bus stops nearest to the query stops. Similarly, in Fig. 12, the red points are the query bus stops, while the blue points are the RKNN bus stops for the query stops.

One typical bus non-on-time query is detecting bus delay aggregation. Please recall that, as listed in Table 2, we partitioned the space of Tampere city into \(2000 \times 1000\) equal-sized grids in Bus-OLAP. Let the maximum search range be \(100 \times 100\) grids. Then, by Bus-OLAP, we can answer following questions with different given time intervals: (1) “On which weekday the most significant delay aggregations took place?”, (2) “In which time interval the most significant delay aggregation occur?”, (3) “Which day and time interval has the most significant bus delay aggregations.”

Table 4 lists the zones of spatial–temporal delay aggregation with different time intervals. We present a zone in the form of (minLon, maxLon, minLat, maxLat), where minLon and maxLon denote, respectively, the minimal and maximal longitude of a zone, minLat and maxLat denote, respectively, the minimal and maximal latitude of a zone. As listed in Table 4, there are more significant delay aggregations during [14:00, 18:00) compared to other time intervals. Furthermore, period [14:00, 18:00) on Friday is the worst time interval as it has the largest number of significant delay aggregations. We guess it is because the traffic flow is heavy in the afternoon rush hour of Friday.

Figure 13 shows the zones with significant bus delay aggregation in the morning (a) and in the afternoon (b), respectively. We find that the zones where bus delay aggregations happened are non-overlapping. We think one possible reason is that the zones where buses are full of people are different in the morning and afternoon. To verify the correctness of the detected zones with bus delay aggregation, we consider the heat delaying cases in Fig. 14. It is easy to see that the detected zones are consistent with the heat distributions.

Intuitively, the bus delay aggregation is closely related to the traffic flow, detecting the zones with significant bus delay aggregation can reflect the change of traffic flow. In addition, it is interesting to investigate the impacts of weather conditions on the bus delay aggregation. To this end, we perform queries taking weather factor into consideration over morning and evening rush hours, respectively.

In Fig. 15, we use rectangles with red solid line, blue dot line, and dark dash line to indicate the zones where significant bus delay aggregation happened in normal weather, rainy weather, and foggy weather, respectively. We can observe the flow of traffic changes from the east side of Tampere city to the west side in both normal weather and foggy weather from 16:00 to 18:00. We also see that the zone with the most significant bus delay aggregation in foggy weather is clearly different from the ones in normal and rainy weather during 8:00 \(\sim\) 9:00 am. In addition, we can see that the zone with the most significant bus delay aggregation in rainy weather is clearly different from the ones in normal and foggy weather during 17:00 \(\sim\) 18:00. We guess the reason lies in that, firstly, the impact of fog on bus delay aggregation mainly exists in the morning, since fog disperses in the afternoon, and secondly, the impact of rain mainly exists in the evening due to the accumulation of rainwater on the road surface.

Table 4 Zone with spatial–temporal delay aggregation of every weekday
Fig. 13
figure 13

Illustration of zones with significant bus delay aggregation, a morning, b afternoon

Fig. 14
figure 14

Bus delay heat map, a morning, b afternoon

Fig. 15
figure 15

Zones with the most significant busy delay aggregation w.r.t. weather activity in morning and evening rush hours. (The detected zones in normal weather, rainy weather, and foggy weather are, respectively, indicated by rectangles with red solid line, blue dot line, and dark dash line.), a [07:00, 08:00), b [08:00, 09:00), c [16:00, 17:00), d [17:00, 18:00)

In Fig. 16, we illustrate the zones with the most significant busy delay aggregations when the average daily temperature is above \(10\,^{\circ }\)C (rectangle with red solid line) and is not above \(10\,^{\circ }\)C (rectangle with blue dash line), respectively. Clearly, the zones detected under different temperature ranges are different. As shown in Figs. 15 and 16, taking weather factors into bus non-on-time analysis is necessary and reasonable.

Fig. 16
figure 16

Zones with the most significant busy delay aggregation w.r.t. average daily temperature in morning and evening rush hours. (Rectangles with red solid line are the zones when the average daily temperature \(> 10\,^{\circ }\hbox {C}\), and rectangles with blue dash line are the zones when the average daily temperature \(\le 10\,^{\circ }\hbox {C}\).), a [07:00, 08:00), b [08:00, 09:00), c [16:00, 17:00), d [17:00, 18:00)

4.2 Efficiency

In Bus-OLAP, we build the indexes based on bit-vectors and perform bus non-on-time queries by index operations. In order to verify the effectiveness of the index building, we first test the runtime for building the indexes with respect to the number of records in Bus-OLAP. Then, we compare the query efficiency using Bus-OLAP against the query using relational database MySQL 5.5. The results of efficiency test are shown in Fig. 17. We test the efficiency by three most frequently used queries, querying the time interval with the most delaying events for temporal query shown in Fig. 17b, searching the RKNN bus stops for spatial query illustrated in Fig. 17c and finding the zones with significant bus delay aggregation from Monday to Friday for temporal–spatial query shown in Fig. 17d. Clearly, as shown in Fig. 17a, we can see that the index building in Bus-OLAP is efficient, and the building time increases linearly with the size of bus journey records. Moreover, performing bus non-on-time query by Bus-OLAP is significantly more efficient than by MySQL.

Fig. 17
figure 17

Efficiency on index building and query, a index building, b query on time interval with most delay, c RKNN query (\(q = 560\)), d query on zones with delay aggregation

Fig. 18
figure 18

Query efficiency w.r.t. number of computing nodes, a computing the LLR of each grid, b query on delay aggregation during [16:00, 17:00)

To satisfy the requirement of large number of queries, we have implement Bus-OLAP on Spark, so that parallel query computation can be supported. Again, we test the query efficiency with respect to the number of computing nodes by finding the zones with significant bus delay aggregation during a certain time interval in weekday. Figure 18a shows the runtime for computing the LLR of every grid with respect to the number of computing nodes. Figure 18b shows the runtime for querying delay aggregation zones during time interval [16:00, 17:00) on weekdays. It is clear that in both LLR computation and aggregation queries, the runtime of Bus-OLAP decreases when more computing nodes are used.

In summary, by our proposed index building method and the parallel query framework, Bus-OLAP is efficient for bus non-on-time query.

5 Conclusions

In this paper, we tackled the novel and interesting problem of non-on-time query over bus journey data. We designed a model, named Bus-OLAP, to support non-on-time queries over the data. For the sake of efficiency, we built the index of bus journey data using bit-vectors and introduced index operations to convert the queries into bitwise operations. In addition, we implemented the distributed query computation based on the Spark framework. Our experiments verified the effectiveness and efficiency of Bus-OLAP.

There are several interesting issues that deserve research effort in the future. First, we will consider more complex scenario applications of non-on-time analysis and fuse some implicit factors (e.g., road conditions, number of passengers) that affect the bus. Second, there are large bus data sets in reality, and we try to apply the Bus-OLAP to more real data and more data sources of other cities. Then, we will continue on optimizing the frequently used non-on-time queries. It is also interesting to consider dynamic computation of the current traffic situation, that is, analyzing if the current situation is different from “normal.” Moreover, we plan to combine the bus data analysis with the management of urban traffic and study the relationships between them.