1 Introduction

The ubiquity of GPS tracking devices and Internet of Things (IoT) technologies has resulted in a collection of massive amounts of data that describe the temporal evolution of moving objects, like cars, trucks, and pedestrians, for government agencies and private companies as well. Many applications exist for finding the best route between multiple points of interest (PoIs), estimating the arrival time to some destination, and even to predict the traffic on a certain route at a certain point in the future (e.g., Google MapsFootnote 1). However, most of these solutions are proprietary and not available to the general public. For example, Google Maps APIs allow computing travel times and distances between locations, and determining the roads that a certain vehicle is traveling. These functionalities are limited by the lack of a query language, and, being proprietary, cannot be extended by the GIS community. This is addressed by MobilityDB (Zimányi et al. 2020), an open-source PostgreSQL and PostGIS extension that provides a set of functions and spatio-temporal datatypes that, together with PostGIS capabilities, form a flexible tool set for querying moving objects. Besides, the General Transit Feed Specification (GTFS) is a common format used to share public transportation schedules and real-time data, along with associated geographic information. GTFS defines two standards, namely GTFS Static, which is used to define schedules, and GTFS Realtime, which is used to communicate live positions of vehicles.

The present work shows how GTFS data can be used to analyze the public transportation system of Buenos Aires using MobilityDB. It describes the software developed to acquire, import, and load the data, and then to query and visualize the results. The solutions described in this work cannot be generalized for importing any GTFS dataset into MobilityDB, because of the different options that the GTFS standard allows. However, many datasets of GTFS transit information may share the same difficulties for importing the data, therefore this work can serve as a guide for many other similar cases. All software developed for this project is publicly available.Footnote 2 More precisely, the contributions of this work are as follows:

  • A description of the data acquisition processes for GTFS Static and GTFS Realtime data for the City of Buenos Aires, in Argentina.

  • A description of the process of importing the data of both GTFS standards into MobilityDB.

  • A study on how MobilityDB can be used to analyze data about the public transport system. This study is carried out using public transport data for Buenos Aires, and includes, among others, a comparison of planned and actual schedules, average delay per bus line, and average speed per line. The goal is to show how MobilityDB makes these queries not only computationally efficient, but also easy to express.

The remainder of this paper is organized as follows: Sect. 2 discusses related work and provides the necessary background to make the paper self-contained. Section 3 describes MobilityDB and compares it against PostGIS as a solution for the study of mobility data. Section 4 presents the case study, and also describes the data acquisition, preprocessing, and post-processing tasks. Section 5 presents the analysis tasks and discusses the results. The paper concludes in Sect. 6.

2 Background and related work

This section presents basic concepts concerning moving object databases, and also their use in traffic analysis. Since the work in the present paper is based on the use of MobilityDB, this database is presented in detail in Sect. 3.

2.1 Moving object databases

Moving objects (MOs) are objects (e.g., cars, trucks, pedestrians) whose spatial features change continuously in time (Güting et al. 2005). Moving objects are represented as sequences of spatio-temporal points, of the form (xyt). As defined in Spaccapietra et al. (2013), MO data are typically divided into parts, called trajectories, defined within an interval \([t_s,t_f]\), where \(t_s\) and \(t_f\) represent the start and the end time instants of the trajectory, respectively. Over these trajectories, different kinds of analyses can be performed (e.g., pattern matching, semantic analysis). Although raw data come in the form of discrete points, for many applications that need to simulate and study the real movement of an object, a continuous representation of a trajectory is needed, and this requires appropriate interpolation functions. When these functions are provided, trajectories are called continuous, otherwise they are denoted discrete. Moving object databases (MODs) are databases that store the positions of MOs at any point in time, in other words, to represent a continuous function from an instant to a point with signature \(f:instant \rightarrow (x,y)\). To represent MOs, the definition of appropriate data types is needed. As explained in Vaisman and Zimányi (2019); Zimányi et al. (2020), temporal types capture the evolution over time of base types and spatial types. For instance, temporal floats may be used to represent how the salary of a person evolves across time. Analogously, a temporal point may represent the evolution in time of the position of a vehicle or a pedestrian, reported by a GPS device, yielding a temporal geometry of type point. In the sequel, following Zimányi et al. (2020), a route denotes a certain spatial trajectory that a moving object can take, without a specific date or time associated to it, and a trip denotes a route repeatedly traversed at a certain time.

The first proposed MOD was SECONDO (Xu and Güting 2013), a MOD developed at the Fern Universität in Hagen, based on the model proposed by Güting et al. (2005). SECONDO provides an extensible architecture that can support spatial and spatio-temporal applications. This architecture has three main components, namely the kernel, the optimizer, and a graphical user interface (GUI). The kernel, being extensible, can implement a wide array of data models, through different algebra modules that provide a collection of type constructors and operators. Hermes is a MOD introduced by Pelekis et al. (2006); Pelekis and Theodoridis (2014), developed at the University of Piraeus, in Greece. In addition of being a MOD, Hermes can also be used as a pure temporal or a pure spatial system. Its functionality is achieved through a collection of abstract data types (ADT) and their corresponding operations, developed and provided as a data cartridge that extends SQL with MO semantics. One of the main features of SECONDO and Hermes is that they provide support to different kinds of MOs that go beyond the basic types and the geometric point type (for example, moving polygons are supported). However, SECONDO and Hermes present several drawbacks to be used as a real-world analytical tool. First, both prototypes are not easily integrated with relational databases. For example, the Hermes cartridges mentioned above, that encapsulate the MO functionality, extend the Oracle DBMS. Therefore, to build an application on top of the database, the application developer must embed PL/SQL scripts into a source program (e.g., a Java program). These scripts are the ones that actually call the Hermes type constructors. Integrating SECONDO with existing DMBSs is even more complicated, given that the former is a packed system.

The problems expressed above are overridden by MobilityDB, a database management system for moving object geospatial trajectories, such as GPS traces. Built on top of PostgreSQL and its spatial extension PostGIS, MobilityDB adds support for temporal and spatio-temporal objects to the PostgreSQL database. MobilityDB can be run in distributed environments, as described in Bakli et al. (2019), therefore it is appropriate to support high data volumes. MobilityDB, like PostgreSQL, is coded in the C programming language; therefore, it seamlessly extends the PostGIS library with temporal data types, which is appropriate for the purpose of running analytic indicators over relational databases. A current limitation of MobilityDB is that it only supports moving points. However, for traffic analysis, moving points are appropriate enough, as it will be shown in the next sections.

2.2 Using moving object data for traffic analysis

There is a wide array of works showing how moving object data can be used to analyze traffic in road networks. A comprehensive study of mobility data analysis problems is presented in Renso et al. (2013). Krogh et al. (2012) use GPS trajectories to estimate speeds on road segments, and propose several indicators. Meng et al. (2017) use loop detectors and taxi GPS trajectories for traffic analysis. Further, Hohmann et al. also study traffic from a user’s point of view (Hohmann and Geistefeldt 2016). Data warehouses (Vaisman and Zimányi 2014) have been also proposed for moving object analysis in general, and for traffic analysis in particular. Andersen et al. (2014) propose the use of a data warehouse for analysing speeds, fuel consumption, among others. Formal frameworks for trajectory analysis using data warehouses are discussed in Leonardi et al. (2014) and da Silva et al. (2015). Recently, Vaisman and Zimányi (2019) proposed the use of MobilityDB for building trajectory data warehouses.

Although sometimes overlooked, the problem of preprocessing GPS data is crucial to guarantee that analysis results are correct. For instance, Krogh et al. (2012) only select trajectories that follow certain paths, and Meng et al. (2017) use map matching just for inferring average speeds. To give an idea of the impact of this step, this paper shows that a large portion of the data sets used here is cleaned out for several reasons. Parent et al. (2013) propose a three-step trajectory preprocessing methodology, consisting in a cleaning phase, followed by a map-matching task, and finally a compression step. Data cleaning is studied in Yan et al. (2013), although much of the task is generally performed manually, like discussed by Fu et al. (2016), usually limiting to clean GPS signal errors. This paper discusses also other kinds of errors.

As mentioned above, map matching is usually part of the data preprocessing tasks. Map matching consists in transforming absolute GPS coordinates into a sequence of road segments, matching raw GPS observations to the road network, accounting for constraints like speed limits and traffic directions. Map matching can be performed online or offline (Wei et al. 2013).

2.3 The GTFS specification

The General Transit Feed Specification (GTFS) is a data format used to define public transportation schedules and real-time data with associated geographic information. The GTFS has two versions, Static and Realtime, the former being the most widely used. GTFS StaticFootnote 3 is used to predefine trip schedules, while GTFS RealtimeFootnote 4 is a feed of real-time data with the positions and timestamps of the data points within trips and routes.

GTFS Static is composed of a series of text files with a CSV format that are stored in a ZIP file. Each file determines a specific aspect about the public transportation schedules, such as stops, routes and trips. The GTFS Static reference contains the following files:

  • agency.txt. Lists the transit agencies that operate the transport routes.

  • stops.txt. Lists the stops that compose the scheduled trips.

  • routes.txt. Lists the routes in the public transport schedule.

  • trips.txt. Lists the trips contained in the schedule.

  • stop_times.txt. Links trips with stops, and adds the arrival time and departure time fields for each stop.

  • calendar.txt and calendar_dates.txt. The file calendar.txt contains a service identification number, and a field for each day of the week, representing if the service is available that day. It also contains a start date and an end date for the service; calendar_dates.txt adds service exceptions, which can be additions or removals.

  • In addition to the above, there are several optional files. Some of these are fare_attributes.txt (which includes trip prices and payment methods), fare_rules.txt, shapes.txt (which includes data points that determine the trajectory taken between stops), frequencies.txt, transfers.txt, pathways.txt, levels.txt, feed_info.txt, etc.

GTFS Realtime is defined in a looser manner, compared against the static option. A GTFS Realtime feed is served via the HTTP protocol, and should provide frequent updates, although there are no constraints on how frequently these updates should be served nor on the exact manner in which the feed is updated or retrieved. Any web server can host and serve the data, and all transport protocols can be used as well. A GTFS Realtime feed can support the following types of information:

  • Trip updates: delays, cancellations, changed routes.

  • Service alerts: events affecting a station, route or trip.

  • Vehicle positions: information about the vehicles currently in service, with their locations and other data such as the congestion level.

GTFS Realtime has two feed elements, messages and enums. The former represent complex types and the latter represent a list of fixed values generally used to communicate certain events. The feed elements are used when the web server communication method is the Protocol Buffer, used by the real-time data feed API for the city of Buenos Aires, which also uses a JSON format body within an HTTP response. The API sends nested messages as fields in the JSON body. The ones used for Buenos Aires are:

  • FeedEntity. This message is sent on all HTTP requests, and provides an update of an entity in the transit feed. It contains an identification field, a TripUpdate message, a VehiclePosition message and an Alert message. However, the Alert message is optional, and it is not implemented in this case.

  • TripUpdate. This message provides an update on the progress of a vehicle along a trip. It contains a trip descriptor, a vehicle descriptor and fields to represent the delay and new stop time that is being alerted.

  • VehiclePosition. This message provides real-time position information of a given vehicle. It contains a trip descriptor, a vehicle descriptor, a position described in latitude and longitude coordinates, a stop identifier of the current stop, and a timestamp in POSIX time.

There is limited scientific literature around GTFS. Vuurstaek et al. (2020) describe a bus stop mapping technique that combines the OpenStreetMap and GTFS open data sources. Kaeoruean et al. (2020) present two approaches for measuring the difference between the demand and supply for public transit. de Queiroz et al. (2019) analyze the conformity of GTFS routes and the actual bus trajectories in four cities in Brazil. Wessel and Widener (2017) study the problem of schedule padding, which is the extra time added to transit schedules to reduce the risk of delay. Regarding visualization, Kunama et al. (2017) present a tool called GTFS-Viz for preprocessing and visualizing GTFS data, and Bast et al. (2014) introduce a tool that shows a worldwide live map of real-time public transit data based on freely available GTFS timetable data and real-time delay information. Finally, Braga et al. (2014) describe a web-based application aiming to simplify the creation and editing of public transportation data. None of the works above includes the notion of moving objects in the analysis of transportation networks. To the authors’ knowledge, this is the first attempt to analyze GTFS data with a moving object database.

3 MobilityDB

This section first presents a brief overview of MobilityDB to make the paper self-contained. Further details about MobilityDB can be found in Zimányi et al. (2020) and in the system’s documentation.Footnote 5 The second part of the section compares the MOD solution based on MobilityDB, against the classic solution based simply in PostGIS.

3.1 MobilityDB Data types and functions

MobilityDB defines temporal types for handling objects whose value changes over time, for example stock prices or temperature, among others. Temporal values are initially built from a discrete set of values and associated timestamps (i.e., observations) and represent the evolution in time of the value. Since MO databases represent a continuous function, values between discrete time instants are interpolated using either a stepwise or a linear function.

Temporal types are based on four time types: the timestamptz type provided by PostgreSQL, and three new types, namely period, timestampset, and periodset. The period type represents a set of timestamps between a lower and an upper bound. The timestampset type is a collection of one or more timestamptz values. The periodset is a non-empty collection of ordered and non-overlapping period values.

MobilityDB provides four temporal alphanumeric types, namely tbool (temporal Boolean), tint (temporal integer), tfloat (temporal float), and ttext (temporal text). Such temporal types are typically used to represent dynamic properties of a moving object. A temporal Boolean can represent, for example, whether a car is driving below the speed limit of the road segment on which it is located. A temporal integer can be used to represent the gear of the car while a temporal float can be used to represent its speed. Finally, a temporal text can be used to represent for example the transportation mode of a moving person, such as walk, car, bicycle, etc.

MobilityDB also provides two spatio-temporal types, namely tgeompoint (temporal geometry point) and tgeogpoint (temporal geography point), which correspond to PostGIS types geometry and geography. The difference between the two is the reference system: geometry points use a Cartesian reference system and allow calculation of speed and other distance-related metrics, while geography points use a geodesic reference system, which implies more complex operations.

MobilityDB includes a vast number functions to access and manage temporal types. Examples of these functions are startTimestamp, endTimestamp, timespan, speed, direction, cumulativeLength, nearestApproachDistance and many more. Some examples illustrating how MobilityDB’s temporal types work are given next. For clarity, the results are given after each query in all the examples.

Example 1

The query below constructs two tint values and applies temporal addition to them. The resulting value is a tint as well.

figure a

The result is obtained as follows. Since the value ‘1’ exists between 2001-01-01 and 2001-01-02, and ‘1’ exists between 2001-01-02 and 2001-01-04 (note the open and closed intervals), ‘1’ and ‘2’ are added in the intersection of the intervals, that is, 2001-01-02 and 2001-01-03.

Example 2

Now we illustrate the temporal intersection (tintersects) between a temporal point and a geometry. The resulting value is a tbool.

figure b

Figure 1 depicts the result given as text above. We can see that the initial and final positions of the temporal point do not intersect the polygon. However, performing a linear interpolation, the moving point intersects the polygon at two locations at instants 2001-01-02 and 2001-01-03. The latter are indicated above as t@2001-01-02 and t@2001-01-03. The former as f@2001-01-01 and f@2001-01-04.

Fig. 1
figure 1

Intersection between a tgeompoint and a polygon

Example 3

We now show how we can create a table Trips containing a temporal column, add data to it, and query the table using MobilityDB.

figure c

Now, we write a query that uses the table Trips defined above, and retrieves the value of the temporal points at a specific timestamp, returning two points (note that points (2,0) and (1,1) exist in the table at the timestamp mentioned in the WHERE clause.

figure d

3.2 MobilityDB vs. PostGIS

A question that immediately arises is the following: Why do we need a MOD when existing tools, e.g., PostGIS, can also handle this problem? There are two key advantages for the MOD approach. On the one hand, queries over moving object data are more concise, easier to understand, and efficient. On the other hand, moving object data can dramatically reduce the storage space. Below we elaborate on these two issues.

3.2.1 Expressing trajectory queries

Consider a table gpsPoint(tripID, pointID, t, geom), storing trajectories, represented using PostGIS.Footnote 6 In this table, tripID and pointID are, respectively, identifiers of the trip and the observation, t is a timestamp, and geom is the geometry of each point. There is also another table pointOfInterest(poIID, poIName, geom), containing points of interest (PoI). Using these tables, a query asking for the points in the trajectory that are within 30 m from a point of interest (PoI) reads:

figure e

Note however, that the query does not account for the time interval of the situation, that is, for how long the trajectory was within 30 m from the PoI. The PostGIS query that solves this problem is quite more involved than the one above, as it can be seen below.

figure f

The query above requires a deep understanding of SQL. It uses Common Table Expressions (CTE) for defining temporary tables to incrementally compute the final result of the query. Table pointPair stores every pair of consecutive points that belong to the same bus trip into one tuple. For this computation, it uses window functions, another advanced SQL feature. Table segment connects these pairs of consecutive points with a line segment, thus performing a linear interpolation between them. The locations where the bus starts/ends within 30 m from the PoI are computed in the approach table. The final query lists such points and computes the time elapsed between them assuming constant speed. The complexity of the query above arises from the problem it addresses: it attempts to represent a continuous movement by means of reconstructing discrete GPS data. On the other hand, using a system that naturally handles continuous data may be a better option.

Consider now the solution using MobilityDB and the functions explained in Sect. 3.1. We first create a table busTrip(tripID, trip), that stores the continuous trajectory. Attribute tripID is the trajectory identifier, while the attribute trip is of type tgeompoint, which is the MobilityDB type for storing a complete trajectory. The query above reads in MobilityDB as follows:

figure g

The nesting of the functions getTime, atValue, and tdwithin returns the time periods during which a bus trip has been within a distance of 30 m from a PoI. The function atPeriodSet restricts the bus trip to only these time periods. The function asText converts the coordinates in the output to textual format.

As a conclusion, it clearly follows that a database that stores continuous trajectories will allow more natural and simple queries than a spatial database based on discrete data types. Furthermore, these queries will be more efficient, in particular since a full trip will be brought to memory with a single database access, while as illustrated above, multiple database accesses are required for the equivalent PostGIS query. Furthermore, more efficient algorithms can be defined for manipulating the continuous trajectories. For example, since the points and associated timestamps are stored in ascending order of time, MobilityDB uses binary search to efficiently locate the position of a moving object at a given timestamp.

3.2.2 Trajectory representation

As seen above, PostGIS represents the trajectory as a sequence of GPS points, each one stored on a single row. On the contrary, MobilityDB makes a more efficient use of the storage space taking advantage of the continuous trajectory notion. That is, for example, if an object moves in straight line at constant speed during a portion of the trajectory, only the starting and ending points of this segment need to be stored. All the intermediate points can be discarded. This can be seen in Fig. 2,Footnote 7 which compares both options. Note that the MobilityDB representation only used the points highlighted in green. As another example, when a moving object does not move, e.g., when it is stopped due to a traffic light or a traffic jam, MobilityDB will remove redundant observations and will only keep two of them when the stopped and when the object started to move again.

Fig. 2
figure 2

Trajectory compression

Experiments performed over the Moscow public transportation system showed a dramatic reduction in the storage space required to store transportation data: 10 billion rows a day (around 500 MB per day), are represented in MobilityDB by 15,000 rows (around 5 MB per day).

3.3 Other MobilityDB applications

Consider now two buses moving in a city, as shown in Fig. 3, call them RouteT1 and RouteT2, respectively. We can see, for instance, that it takes 15 minutes to the first bus to go from point (0,0) to point (3 3). Then it stopped for 10 minutes at that point. We assume a constant speed between consecutive pairs of points. Thus, RouteT1 travelled a distance of \(\sqrt{18}=4.24\) in 15 minutes, while RouteT2 travelled a distance of \(\sqrt{5}=2.23\) in the first 10 minutes and a distance of 1 in the following 5 minutes.

Fig. 3
figure 3

Graphical representation of the trajectories of two buses

In MobilityDB, the operation trajectory projects moving geometries into the spatial plane. The projection of a temporal point into the plane may consist of points and lines, the projection of a temporal line into the plane may consist of lines and regions, and the projection of a temporal region into the plane consists in a region. In our example, trajectory(RouteT1) would result in the leftmost line in Fig. 3, without any temporal information.

This raises an interesting question and illustrates another key feature of using MOD for mobility analysis. If we want to study how close to each other are two bus lines at any time, the spatial information would not be enough. We must compute the distance between the two lines at any time instant. Thus, the ST_Distance function in PostGIS would not be enough. We would need the MobilityDB function, distance(RouteT1, RouteT2), which returns a temporal real value, shown in Fig. 4. It can be seen, for instance, that this function has value 1.5 at 8:10 and 1.41 at 8:15.Footnote 8

Fig. 4
figure 4

Distance between the trajectories of the two vehicles in Fig. 3

We now briefly explain why this feature is not easily performed in PostGIS (or any spatial database). Consider the two moving points p and q on the left-hand side of Fig. 5. In order to compute the distance between these objects, we first need to temporally synchronize them, as the figure shows. This synchronization is performed internally by MobilityDB, restricting the two trajectories to their common time span (from \(t_1\) to \(t_5'\)), and adding the intermediate synchronization points represented by the dashed vertical lines. In the figure, the solid circles represent the observations while the hollow circles represent the interpolated points added for synchronization. Two interpolated points are highlighted in the figure, shown within a box. Then, the computation of the temporal distance is performed for each synchronized segment. For this, we compute the distance at the beginning and the end of the segment but in addition we need to determine whether there is a turning point, which is the timestamp at which the distance between the two trajectories is minimal. These are represented in the figure by the dotted lines. In this figure, we have two turning points where the two objects are at the same point at the same time, and thus the distance between them is zero.

Fig. 5
figure 5

Computing the temporal distance between two moving objects. Left: synchronization of the two trajectories. Right: computing the turning point for each segment of the synchronized trajectories

On the right-hand side of Fig. 5 we illustrate the general case of two synchronized segments, where the two objects are at the same point p2 at two different timestamps. In the figure, p moves from p2@t1 to p1@t2, while q moves from p3@t1 to p2@t2. The turning point is indicated with the dotted line and in this case, opposite to the case above, the distance between the trajectories at the turning point is not zero. Therefore, the result of the temporal distance for this segment would be composed of three values: ST_Distance(p2,p3)@t1, ST_Distance(p,q)@t, and ST_Distance(p2,p1)@t2. This computation must be performed for every pair of synchronized segments. Therefore, the reader could understand intuitively that computing the temporal distance in PostGIS (i.e., without the MobilityDB temporal point data type) would be very complex and inefficient.

4 Case study

This section describes the use of MobilityDB for analyzing GTFS data, both Static and Realtime. For this, data of the public transportation system in Buenos Aires, Argentina are used. The area under study includes the city of Buenos Aires and its outskirts, known as the Metropolitan Area of Buenos Aires (AMBA).

The AMBA public transportation system consists of three main branches. The subway system, contained in the city itself, the metropolitan railway system, which connects Buenos Aires with its suburbs, and the bus system, composed of hundreds of municipal and provincial bus lines from all across the urban area. The open data site of the city of Buenos AiresFootnote 9 provides both, the itineraries for all branches of the AMBA transport system as well as real-time information on the moving vehicles. For the present work, the information referred above is loaded and analyzed using MobilityDB to obtain, for example, the average speed for each hour, for each day of the week, vehicles passing close to a PoI, and average delay for each bus. Visualizations like transport delay heat maps are also produced.

The data acquisition and preprocessing tasks are described next. The use of these data for analysis is described in Sect. 5.

4.1 Data model

The input data must be converted into MobilityDB’s native spatio-temporal types. Thus, the output of the import process is a relational table called trips_mdb with the following columns:

  • trip_id: The identifier of a particular trip.

  • vehicle_id: The identifier of a particular vehicle in service.

  • startdate: The starting date of the trip.

  • starttime: The starting time of the trip.

  • trip: The location and time information for the whole trip represented using MobilityDB tgeompoint data type. This is the column used in most of the queries below.

  • traj: The spatial trajectory of the trip represented using PostGIS geometry data type. In other words, this is the spatial projection of the trip. This column is used for representing the trajectories of the trips graphically, e.g., using QGIS.

As discussed in Sect. 3.2, the MobilityDB type tgeompoint keeps the discrete set of points and associated timestamps (i.e., the observations) for a complete trip in a single value. All these observations would be stored in multiple rows in PostGIS. By collecting all the points and timestamps of a trip in a single value, MobilityDB is able to simulate continuous spatio-temporal data assuming a linear interpolation between subsequent observations. Therefore, in this work, GTFS data are turned into spatio-temporal continuous trajectories, such that, for GTFS Static they represent the planned itineraries while for GTFS Realtime they represent the actual mobility data in real time. Therefore, this case study shows the advantages of using moving object (i.e., continuous) data, instead of the classic solution that uses discrete data.

4.2 Data acquisition and preprocessing

This section reports the work required to acquire, import, process, and load data into MobilityDB structures. The study is divided into two sections: GTFS static and GTFS Realtime data.

4.2.1 GTFS static

Although GTFS Static data sets for the city of Buenos Aires can be obtained for trains, metro, and buses, the latest data set that can be found on the government’s official website is from August 2019. Thus, we decided to use a more recent dataset from OpenMobilityDataFootnote 10 that spans from April 20 to October 20, 2020, for the city’s bus system. The downloaded data for the bus system in Buenos Aires, includes the following files: agency.txt, calendar_dates.txt, routes.txt, shapes.txt, stop_times.txt, stops.txt, and trips_txt. These files are described in Sect. 2.3 above. We remark that, although train and metro static data are available, since this is not the case for real-time data (see below), only bus data are considered for the present study.

The bus system data present an anomaly which raises issues during interpolation: there are cases where two distinct stops are very close to each other, and the feed specified the arrival time to be the same for both stops. This is illustrated in Fig. 6, where we can see two rows with the same arrival_time and different stop_id.

Fig. 6
figure 6

Bus system anomaly

The GTFS Static Pipeline is split in two steps, namely, preprocessing and data importing. The preprocessing pipeline includes two scripts:

  • Data pruner: a Python script that removes unused columns from GTFS data.

  • Data wrangler: a Go script that finds anomalies in arrival times and modifies values to allow interpolation.

Once the data are preprocessed, the data importing phase takes place. This is depicted in Fig. 7, and includes three scripts:

  • GTFS Importer: an SQL script that loads preprocessed data into auxiliary tables.

  • Dates Importer: an SQL script that loads service dates into auxiliary tables depending on the availability of the files calendar_dates.txt and/or calendar.txt.

  • MDB Importer: an SQL script that populates the trips_mdb table. It takes care of generating the geometry of every trip’s route, calculating arrival times of all trip stops, and finally generating the tgeompoint from the GTFS information.

All agencies reported in the transit feed use UTC-3 Timezone, thus, timestamps are loaded with America/Argentina/BuenosAires timezone.

Fig. 7
figure 7

GTFS-Static importing pipeline

4.2.2 GTFS Realtime

The Transportation Ministry of the city of Buenos Aires provides a GTFS Realtime API that allows users to track the state of the public transportation vehicles.Footnote 11 This API provides endpoints that expose the status of both, trains and buses in the city and its outskirts. The endpoints are updated at an interval of thirty seconds, and support both Protocol Buffers and JSON responses. At the moment of writing this work, there is no endpoint for querying subway lines in real time, and during the week of polling, the train endpoint did not return any values. Therefore, the data extraction is limited to the bus system. However, for the goals of this work, the bus data suffices.

BA Catcher A scraper, called BA Catcher, that polls the transportation system at the update interval, has been developed. The scraper is run against the JSON endpoint starting August 18 and ending August 25, 2020. Figure 8 shows the architecture of the BA Catcher, composed of the following services:

  • BA Transport API: Transport API provided by the Buenos Aires Transport Ministry.

  • Polling Service: Service responsible for setting up the recurring JSON requests, parsing the response, and validating the values.

  • Position DAO: It is responsible for persisting values to the database, removing duplicate values.

  • Position Database: a PostgreSQL database that contains a single table Positions where all relevant information of the observations is persisted.

Fig. 8
figure 8

BA Catcher Architecture Diagram

For exploratory data analysis, the query below, expressed using PostgreSQL, produces a bar chart showing the number of timestamped locations reported for each trip. Figure 9 depicts the barchart, where it can be seen that there is a large portion of trips where less than 10 locations have been reported.

figure h

Table buckets defines a set of buckets, whose bounds were selected after executing some exploratory queries on the data. Table tripCount stores for each trip_id the number of observations in the data set. This table is joined with its corresponding bucket in table bucketTrip, according to the no_observ value. Then, the histogram table associates each bucket with the number of trips in the bucket. The final SELECT statement outputs this information to the terminal alongside a simple ASCII bar chart to improve readability.

Fig. 9
figure 9

Timestamped location frequency barchart

Data structure The fields stored from the HTTP requests for obtaining the real-time data are the following:

  • trip_id: The identifier of the trip that the moving object is performing. This identifier coincides with the identifier used in the static data.

  • vehicle_id: The identifier of the vehicle in service.

  • instant: A timestamp for the data being sent, in POSIX time.

  • latitude: The latitude of the vehicle at the instant, in the UTM coordinate system.

  • longitude: The longitude of the vehicle at the instant, in the UTM coordinate system.

  • startdate: The date in which the trip started, in YYYYMMDD format.

  • starttime: The time in which the trip started, in 24h format.

  • direction_id: The direction of travel for a trip, can be a 0 or 1, e.g., outbound or inbound.

There are several rows with positive latitude and longitude values, which are not coherent with the geographical location of the city of Buenos Aires. These values correspond to points somewhere in the Atlantic Ocean. When these points are plotted with negative values, they match the current trip.

Data acquistion and preprocessing Figure 10 illustrates the pipeline created to import the GTFS Realtime data used in this study. The BA Catcher component has been previously explained. The BA Exporter.sql program uses the data that BA Catcher stores, and creates a CSV file that is used for preprocessing. A Python program, Coordinate Corrector.py fixes the errors in the coordinate values mentioned above. The BA Importer.sql module creates a table called positions with the direct import of the fields mentioned above. With this table, the script creates points of the geometry type using the PostGIS function ST_MakePoint. The SRID of the geometry is set to 4326 (the WGS84 standard longitude and latitude coordinates on the Earth’s surface), since that is the format in which data come. Then, the table described at the beginning of the section is created, and the data from the table positions is imported. The following code creates the tgeompoint of a trip.

Fig. 10
figure 10

GTFS Realtime importing pipeline

figure i

The function tgeompoint_inst creates a tgeompoint with a single point at a certain instant, and by aggregating these with array_agg, they can be passed to tgeompoint_seq to create a tgeompoint that represents the whole trip, which is added to the trips_mdb table. The additional field traj is filled using the MobilityDB function trajectory, which returns the PostGIS geometry trajectory that a tgeompoint contains.

Both Static and Realtime values come in SRID 4326, which is a geodetic coordinate system. To calculate distance, speed, and other metrics, a plannar reference system is needed. Thus, the SRID 5345 is used, which is a Cartesian system that encompasses all of Argentina and its surroundings. To change the reference system, a MobilityDB function is executed over the loaded data. As another preprocessing action, bus lines with less than eleven real-time observations are removed to guarantee that they are useful for data analysis.

Finally, to compare the real-time and static bus feeds, all bus lines not present in the real-time feed are removed from the static feed, and, conversely, all bus lines not present in the static feed are removed from the real-time feed.

4.3 Map matching

To improve the results, offline map matching was applied to the data. For this, we used Barefoot,Footnote 12 an open-source Java library for online and offline map matching with OpenStreetMap. However, several problems were found, namely: (a) The offline map matching provided by Barefoot can only correct one route per thread. At the scale of thousands of trips like in the city of Buenos Aires, this constraint demands a parallel pipeline which would handle the workload; (b) Even though the BA Transport API documentation states that the response is updated every thirty seconds, this statement does not seem to be true for every line. In practice, the average interval between two adjacent samples for a given line was about 130 seconds, but the interval variance is very high, reaching a value up to 600 seconds. In this study, where both the uncertainty (given by the interval) and the scale (given by the size of the transport network) are very high, alternative solutions for the map-matching problem are explored, and described next.

Although real-time data suffer from inaccuracies due to both GPS signal errors and sampling frequency, these problems are not present in static data. The static input data are provided already map-matched. Therefore, the problem of matching the real-time trajectories to the physical route network equals matching the real time trajectories to the static ones. In order to test this hypothesis, an example of a trajectory that clearly displays the problem is shown next. The bus line #152 is an example of this issue, since during its trajectory it goes by the presidential house, along a large roundabout, and thus, with the current real-time frequency, MobilityDB is unable to create an adequate route, as shown on the left-hand side of Fig. 11.

Fig. 11
figure 11

Left: Bus line #152, static route (blue), real time data points (red), generated route (pink); Right: Trajectory corrected with the map-matching algorithm

The trajectory generated from the data points goes across the park, because the sampling frequency is not high enough to create a route that matches the static one. A manual map-matching algorithm is developed, and explained below. The code shown in Listing 1 produces such a match.

figure j

The subquery on Line 9 returns the line of the static trajectory that is contained between the two points closest to the red points (the extreme points, obtained with the API, that can be seen in the figure). This trajectory is referred as line. The query also returns dump, an array of points that are contained in the previously mentioned trajectory. The WHERE clause (Line 45) specifies the exact trip that is shown in Fig. 11. To obtain line and dump, the function ST_ClosestPoint is invoked with the trajectory and the red point shown in the figure. This returns the point within the trajectory, closest to the red one. With the function ST_LineLocatePoint, the percentage of the trajectory in which the mentioned point is found is obtained. By calling ST_LineSubstring with both points obtained, and the trajectory, the trajectory contained within both data points is retrieved. With line and dump, the MobilityDB function nearestApproachInstant is used to generate the tgeompoint, and thus obtain the map matched route, shown on the right-hand side of Fig. 11. It can be noted that now the pink route matches the original route spatially.

In conclusion, the results obtained with the simple map-matching solution implemented in MobilityDB are reasonably good. Furthermore, this approach is considerably much more efficient than the one followed by Barefoot, which uses a Hidden Markov Model map-matching implementation. In this particular case, the method yields better results because it takes advantage of the expected route data. Obviously, this functionality can be implemented in PostGIS but this would require to follow an approach similar to the one sketched in Sect. 3.2 to connect subsequent observations, and rewrite in SQL all the functionality natively provided by MobilityDB for computing the turning points of the distance function (e.g., function nearestApproachInstant above) between the planned and the actual trajectories.

5 Analysis and results

This section shows the use of MobilityDB for analyzing static and real-time public public transport in the city of Buenos Aires. First, GTFS Static data are analyzed and the results displayed by means of visualizations. Then, analyses that use both GTFS Static and Realtime datasets together are carried out. The section concludes with a discussion of the results.

To create the visualizations reported below, Grafana,Footnote 13 a web application that allows creating dashboards with data from multiple databases, is used. For the visualizations that contain maps, QGISFootnote 14 is used. The maps are obtained from OpenStreetMap.Footnote 15

5.1 GTFS Static

Figure 12 shows a portion of the trajectories for the real time bus feed. The trajectories are computed using MobilityDB, as the spatial projection of the continuous spatio-temporal trajectories produced from the schedules and stops data. From these spatio-temporal trajectories, interesting analyses can be performed, as shown next.

Fig. 12
figure 12

GTFS Static bus lines trajectories

Query 1

List the trajectories and their bus identifiers, that pass at a distance less than 200 m away from Colón Theater.

MobilityDB utility functions are used to compute and visualize the trajectories for all buses that pass close to a PoI, in this case, trajectories that pass at a distance less than 200 m away from the Colón Theater, a world-famous opera venue. Results are displayed in Figs. 13 and 14. Comparing the density of the trajectories displayed in Fig. 12, and considering that the Colón Theater is located downtown in the city, the radial design of the public bus system becomes evident: a large part of the lines go from the suburbs to downtown Buenos Aires.

Fig. 13
figure 13

Bus trajectories close to Colón Theater

Fig. 14
figure 14

IDs of bus lines that pass at a distance less than 200m from Colón Theater

figure k

Listing 2 depicts the query, which defines the table trip_distances, where the bus line and the shortest distance to the point of interest are stored. In order to calculate the latter value, MobilityDB functions are used, in particular shortestLine, which receives a tgeompoint and a geompoint, and returns the shortest line that connects the two figures. Online web tools are used to find the coordinates of the PoI (in SRID 5345). These coordinates are passed on to the function ST_MakePoint. The PostGIS function ST_Length retrieves the desired metric. The final SELECT statement simply provides the data in a format readable by Grafana. Since a single bus line may be associated to many trip_ids, the final output averages the minimum distance from all trips for the given bus line. Figure 14 also shows, using a red gradient, the relative average distances between each bus line and the theater.

5.2 GTFS Static and Real-time comparison

Figure 15 shows a portion of the positions of buses registered in the week of 18–25 August. By identifying the individual bus lines with the trip_id it is possible to query both, real-time and static feeds, for the different trips of a particular bus line, as shown in Fig. 16. Figure 17 shows the comparison for Line #152A, between real time and planned trajectories. It can be seen that both are similar.

Fig. 15
figure 15

Actual trajectory of buses

Fig. 16
figure 16

Comparing static and real-time feeds

Fig. 17
figure 17

Bus line #152A trajectory comparison

Queries on real-time data stored as continuous data can be efficiently and easily expressed using MobilityDB functions. An example is shown next.

Query 2

Compute the average speed of the vehicles by starting hour and day.

Listing 3 shows the MobilityDB queries that solve the problem.

figure l

With the trip field as a tgeompoint, the MobilityDB function twavg() computes the average speed of a trip. This function receives a list of numbers with a temporal value and computes the time-weighted average of these numbers. By using another MobilityDB function, speed(), in conjunction with twavg(), the average speed of the trips is obtained. Figure 18 displays the result, comparing real-time and static data. It can be noticed that the buses are moving considerably faster than in the planned itinerary. It also seems that the itinerary does not take into account the traffic congestion changes occurring during the day, while in the real-time data obtained, higher speeds are registered at midnight every day. The average speed of buses oscillates between 18km/h and 27km/h. When grouped by day, in both real-time and itinerary data, the average speed rises on weekends, and remains constant on weekdays. This is expected, since there are less commuters on weekends that may create traffic jams and congestion in the streets (see Agarwal 2004).

Fig. 18
figure 18

Average speed grouped by hour and date comparison

Another query, relevant for traffic analysis is shown next.

Query 3

Compute the average delay by bus line.

The query must compute the average delays, grouped by bus lines. This is shown in Listing 4 and results are depicted in Fig. 19.

figure m
Fig. 19
figure 19

Average delay by bus line

MobilityDB’s function timespan computes the duration of the trip. Then, the query uses pure PostgreSQL syntax. By calculating the difference between the duration of both tgeompoints it is possible to compute the delay for the whole trip, which is depicted in Fig. 19). It can be observed that only two lines have delays with respect to the itinerary. This is coherent with the speed data computed above. Results are expressed in minutes.

Query 4

Compute a heatmap comparing real and planned trajectories for different bus lines.

With the real-time and static tgeompoints, the trip_id is used to find both, the theoretical and real trajectories of every bus in the system. Taking advantage of MobilityDB’s functions to find regions of proximity between spatio-temporal objects, a delay heatmap for every bus in the system is built. Listing 5 shows a portion of the code for this, that is, the queries for the definition and population of the heatmap tables. The result is depicted in Fig. 20 for two different lines. Segments of the route where both, the realtime and static buses are close to each other are painted green, as the theoretical and real buses move away from each other the segment changes color towards red. Using MobilityDB syntax the segments for which the real-time and static buses were near each other up to a given tolerance are selected. The function tdwithin generates a continuous boolean temporal type which has value TRUE when the temporal points are within a given distance from each other. Combining this with the atPeriodSet function it is possible to discard all segments from the tgeompoint where the trips are farther away than the given tolerance. The WHERE clause allows improvement of the performance of the query by taking advantage of the topological operator && (overlaps) which launches an index search.

Five tables are created, each one containing the trip segments for different degrees of tolerance. Once visualized in an application such as QGIS, the stepped-tolerance creates a heatmap-like visualization. These visualizations are shown in Figs. 20 and 21.

figure n
Fig. 20
figure 20

Heatmaps for bus lines #271P (left) and #80B (right)

Fig. 21
figure 21

Heatmaps for bus line 152A

As a final discussion, many visualizations confirm significant differences between the real location of the buses and their static itineraries. These differences are to be expected by any person who uses public transport in her daily life. However, it is interesting to observe that in the cases reported here, these differences are not delays but, on the contrary, the real bus time is ahead of its itinerary. In any other year this observation would most likely lead to conclude that the data are erroneous. However, due to the extremely unusual events that have taken place in 2020, it is believed that the cause for this result is different. The sampled data corresponds to August in Argentina, when the city of Buenos Aires and its outskirts were going through a strict lockdown due to the SARS-CoV-2 pandemic, except for public transportation. Many studies have already been analyzing the effects of the lockdown on traffic and circulation. Amongst them are those given by the Google Mobility Report, shown in Fig. 22. It can be observed there, that the results obtained in the use case reported here are explained by the traffic reduction in the streets of Buenos Aires.

Fig. 22
figure 22

Google Mobility Report Buenos Aires 25-Aug-2020

6 Conclusion

This work shows how MobilityDB, an open-source moving object database, can be used to analyze mobility data, in particular, public transport data. For this, the city of Buenos Aires, Argentina, is used as a case study. Public data about bus, train, and subway schedules are available compliant with the GTFS Static standard. Further, real-time data for buses are also available. These data are captured, preprocessed and loaded into a PostgreSQL database provided with the PostGIS and MobilityDB extensions, and used to analyze schedules, itineraries, delays, and other typical questions of interest for transport planners. The processes carried out and software developed for all the tasks involved, like capturing data, are also discussed (software is also publicly available). Further, a novel map matching method is also proposed, that uses MobilityDB to match real-time trajectories to roads using the planned trajectories, and considering that the latter represent the actual roads to which the former ones must be matched.

The processes described in this paper cannot be replicated and generalized for importing any GTFS dataset into MobilityDB, because of the wide spectrum of formats that the GTFS standard allows. However, this study is a valuable reference for similar cases. The queries displayed in the analysis section illustrate how transit data can be queried in MobilityDB, and displayed using visualization tools. Although Grafana was used in this paper, any other similar tool could be used to visualize the results.

MobilityDB is continuously evolving in several directions. In particular, to be able to process the massive amounts of movement data that are currently being generated, a distributed version of MobilityDB that works in cloud environments such as Azure, AWS, or Google Cloud Platform, is under development. As another direction, GTFS Static represents periodic movement data that are valid during a certain time interval (Behr et al. 2006). The current approach used in this paper requires to “instantiate” these periodic data to represent each actual occurrence. For instance, a service occurring every Monday at 8 am will be replicated for each Monday in the period of analysis. To avoid this, it is planned to implement in MobilityDB a new data type to account for periodic movements.