Public Transport Commuting Analytics: A Longitudinal Study Based on GPS Tracking and Unsupervised Learning

This paper uses unlabelled GPS tracking data collected by a smartphone application, enriched by fusion with automatic vehicle location (AVL) data, to study commuting trips from home to work and vice versa. Such commuting trips play a significant part in public transport (PT), and in transport planning in general. This work investigates patterns of mobility, based on multiple thousands of recorded trips over a set of users in a longitudinal study by, first, determining unsupervised clustering algorithms to impute work and home locations, then analysing relevant characteristics, such as departure times, mode/line choice and trip duration. Finally, a heuristics algorithm is proposed to analyse the extent and frequency of similar trips. The results quantify amount and limits of the regularity of individual commuting behaviour in terms of repeatable travel choices. Commuters are quite consistent in their choices of departure times and lines used, even though differences are found among the two directions of the commuting trips, with work–home trips having a greater average duration and, in many cases, different choices of lines.


Introduction
Investigating commuting patterns is a key element for enhancing urban networks, as they reflect the long-term and usual behaviour of individuals and have a considerable impact on human mobility (Kung et al. 2014).Studies on car, public transport (PT) and, more recently, bike commuting provide critical insights into how individuals move within and between urban environments, with significant implications for urban planning, transportation infrastructure, and sustainability.For example, it is well known that efficient and reliable PT is an effective tool to mitigate traffic congestion and alleviate emissions, whilst significantly reducing car dependency (Ma et al. 2017) and the pressure on scarce road space.While high-quality PT links to employment centres are shown to encourage switches away from car commuting (Clark et al. 2016), planners and operators must identify opportunities to keep PT an attractive alternative to the population, as its operations are strongly constrained by phenomena at peak hours, mostly due to commuting.On the other side, car commuting has implications beyond road traffic congestion and high emissions: it also has spatial implications on urban centres by reducing walkability, transit accessibility, and community interactions, as car-oriented urban design often prioritizes vehicle movement over pedestrians and cyclists (Boulange et al. 2017).In this sense, measuring the regularity of individual travel behaviour over time, in terms of discrete and repeatable choices, is key to understanding their decision-making process, enabling service providers to enhance customer analytics and mobility prediction (Goulet-Langlois et al. 2017).
Behavioural studies have been traditionally done through surveys, either stated preference (SP) or revealed preference (RP).Both approaches are known to have limitations: in SP because the users are not experiencing the trip themselves and the survey relies on stated intentions instead of longitudinally revealed behaviour; in RP, because the path sampling process is very complex, causing the few surveys reported to be useful only 15 Page 2 of 17 for long-distance travel situations (de Freitas et al. 2019).Many studies use individuals' socioeconomic and sociodemographic characteristics acquired through questionnaires and microcensus to determine a mobility profile.de Freitas et al. (2019) summarize some relevant works.However, although these studies with categorical data exist, they do not exactly reveal the behaviour of a random user confronted with many possible routes choices and, more importantly, they are usually constrained to small sample sizes and short observation periods (Goulet-Langlois et al. 2017).Nevertheless, these surveys are costly, time-consuming and frequently result in low sampling rates (Ma et al. 2017).Recent developments in technologies to acquire data, as well as emerging statistical methods for the analysis of such data, allow better investigation of individual commuting behaviour, which is crucial for the purpose of achieving the full potential of intelligent transportation systems.New big data sources allow a more detailed analysis of vehicle, transit, bicycle, and pedestrian trips than ever before (Griffin et al. 2020), for example, Global Positioning Systems (GPS) tracking data is a low-cost and efficient alternative to these traditionally used travel surveys.New technologies allow collecting such data automatically, for instance through smartphone applications that significantly reduce the burden placed on users, with low battery usage, satisfactory spatio-temporal precision and restricted (or absent) user interaction in the form of manual inputs (Marra et al. 2019).
A growing range of publications have been published in recent years as a result of technologies that allow massive amount of data to be collected.Automated fare collection (AFC) data and GPS records are amongst the most reported data types for studying mobility analytics in PT, whereas automatic vehicle identification (AVI), automatic license plate recognition (ALPR) data, GPS records and mobile phone data, such as call detail records (CDRs), are the reported for investigating private vehicle commuting behaviour.However, as highlighted by Kim et al. (2017), the literature still lacks studies investigating how the individual trip decision-making process translates into spatial and temporal patterns across a transit system.In the case of commuting behaviour, for example, most works define commuting only in terms of repeatability of temporal activities (e.g.users travelling four days or more per week are considered commuters) and only a few works model commuting behaviour both in terms of spatial and temporal regularities (Ma et al. 2017).On top of that, studies combining private (car, bike) and PT commuting are scarce.The main reason is that only a few data sources can track users in both PT and private modes of transport.With GPS, it is possible to study trips that are sometimes done by PT or private mode.Furthermore, GPS also allows knowing very precisely the origin and destination points (OD pair) of trips and the same cannot be achieved with other commonly used data sources, such as AFC data that only include the boarding(/alighting) locations.
This paper investigates repeatability, and in general, patterns of demand, for public and private transport users commuting between home and work locations.In particular, the present study defines commuting in terms of trips starting and ending at specific locations (home and work) and serving the same purpose (commuting to work from home, and vice versa), and it aims to measure the regularity of the individual commuting travel behaviour over time, by considering repeatable choices of three main parameters, namely mode/ line choice, departure time and duration.Studying variability in travel behaviour is fundamental to a better understanding of mobility patterns and, in particular, travel demand variability.GPS tracking data (which provides very low-level information, and comes with relatively low quality, and a non-systematic sampling rate), fused with AVL (automatic vehicle location) data (which describes the actual PT supply) is a promising solution to expand and develop such studies.
This research is motivated by gaps identified by the existing literature (e.g.Gärling and Axhausen 2003;Kim et al. 2017;Levinson and Zhu 2013;Lima et al. 2016;Ma et al. 2013Ma et al. , 2017;;Schlich and Axhausen 2003) supporting that mobility patterns and route choice behaviour are somehow regular for commuters.We limit the analysis to such commuting trips, where trip purpose and location are inferred, and other attributes linked to repeatability are studied, such as departure times, mode/line choices and trip duration.Specifically, it adds to state of the art: (i). it proposes an unsupervised approach to distinguish between commuting (trips from home to work, and vice versa) and non-commuting trips.This approach has a high degree of flexibility and is able to identify non-standard and low-frequency commuters, such as those doing night shifts or those commuting only a couple of days per week; (ii). it summarizes the main behavioural aspects and characteristics of travellers in commuting trips and, whenever applicable, compares the differences with non-commuting trips in the context of system-wide patterns; (iii).it identifies, for each traveller over time, a subset of commuting trips which are characterised by the (simultaneous) regularity of mode/lines, departure times and duration.Therefore, it reveals the frequency of similar trips for each user.6838 trips of 172 users in the city of Zürich (Switzerland).The application allowed (continuous) passive tracking, and activities, trips and modes were identified through a mode detection algorithm, as described in Marra et al. (2019).The paper continues with a literature review.Then, Sect."Methodology" reports the methodology.Section "Behavioural Aspects of PT Commuting Trips" analyses, separately and together, multiple aspects of repeated trips.Section "Conclusions and Future Research" concludes the paper.

State of the Art
Most previous works on commuting patterns focus on either private vehicle traffic (including ride-sharing), for which data is more accessible (Zhao et al. 2019), or solely on PT or bike/e-bike commuting.The main reasons lie on the data source type (that may not allow for collecting data on the different transport modes) and particular objectives of these studies.For example, Hong et al. (2020); Zhao et al. (2019) use automatic vehicle identification (AVI) and automatic license plate recognition (ALPR) data, respectively, to investigate the commuting behaviour of private vehicle travel in different cities in China, whereas Li et al. (2005) uses GPS data and a logit model to investigate morning commute.Mobile phone data, such as call detail records (CDRs), is also reported [see, e.g., (Kung et al. 2014)], although it is usually "car-heavy".On the PT side, AFC data is the most reported data source to study trip regularity, as it allows for massive amount of data to be collected and it is suitable to investigate standard, regular commuting trips.For example, Ma et al. (2013Ma et al. ( , 2017) ) propose a series of data mining methods using smart card data to cluster, for each traveller, the set of most frequent commuting trips to investigate spatial and temporal travel patterns in Beijing.Zhou et al. (2014) use linear programming to investigate commuting efficiency along with the bus network in the Beijing metropolitan area by combining smart card and travel survey data.Also on commuting patterns of bus riders and AFC data, Kim et al. (2017) introduce a metric called "stickiness index" to classify users according to the regularity in which they choose their routes, which relates to the frequency of similar routes.Kusakabe and Asakura (2014) use smart card data with the aim of analysing behavioural features to classify trip purpose utilising a naive Bayes classifier.Ortega-Tong (2013) also explores the similarity in travel patterns from riders with smart cards combined with socio-demographic characteristics to identify clusters with similar structures.Goulet-Langlois et al. (2016) propose the clustering of PT stops and stations to infer activities based on likely activity locations.
However, the main challenge (and drawback) for AFC data is that, in general, only boarding (and, sometimes alighting) locations are known, but not the passenger's full itinerary.As highlighted by Berggren et al. (2021), the challenge to obtain full itineraries is usually approached by some kind of mirroring of the boarding profile, although the methods for identifying potential transfers and other activities destinations differ, often making use of different sources of data.For example, Farzin (2008) uses three data sources (onboard AVL transmissions, geographic bus stop location data, and AFC smart card boarding data) to construct a bus OD matrix and infer the final destination zone, while Trépanier et al. (2007) rely on AFC smart card boarding data and passenger counting system data to obtain OD matrices.Barry et al. (2009) use smart card AFC data to infer the traveller's complete itinerary using a schedule-based shortestpath algorithm.All these studies, however, rely on a couple of conservative assumptions (i) for a trip, users return to their previous trip's zone/destination station; and (ii) the last trip of the day always ends in the same zone/station as the origin of the first trip of the day (the home zone), regardless of how many trips were taken during the day.Nevertheless, even such complex procedures and assumptions could only guarantee approximated locations and OD matrices, and yet they were subject to errors and inaccuracies.For example, Trépanier et al. (2007) reported a 66% success rate (80% for peak hours) for their method to infer alighting locations of bus stops.Furthermore, the process of distinguishing between transfers and activities in trip itineraries is challenging and studies based on AFC data cannot be linked easily with activity locations.
Some studies reinforce the importance of the spatial/ temporal features chosen in terms of passenger behaviour.Carrel et al. (2013) conclude that departure regularity is the most important path-specific feature, and they show that PT passengers gradually adapt upon changes to departure reliability and headway lengths, although for short headways this adaptation consists mostly of a stochastic behaviour.In terms of comparable trips across days and weeks, Trépanier et al. (2007) present a study with smart card data showing that these comparable trips are regularly made from Monday through Thursday, are of the same distance and have the same direction.Goulet-Langlois et al. (2017) hypothesize that the order in which an individual engages in trips and activities is an important characteristic of travel behaviour, so they propose an approach to measure the regularity of travel behaviour based on the order in which travel events are organised over time in travel sequences.Their conclusion is that travel regularity may follow atypical patterns which are not captured by either periodicity-based methods or activity-based models.Ma et al. (2013) use DBSCAN, a density-based clustering algorithm, to investigate spatial and temporal travel patterns in Beijing, and then utilise a K-means++ algorithm and a Rough Set-based approach to measure travel 15 Page 4 of 17 regularity.The clusters are formed based on regularity attributes for each traveller, namely number of travel days, number of similar first boarding times, number of similar route sequences and number of similar stop ID sequences.The study, however, does not contemplate full trip itineraries or trip purpose and the only temporal attribute considered is the number of travel days.Still investigating travel patterns in Beijing, Ma et al. (2017) find out that the majority of commuters depart around morning peak hours (7:00-9:00) and return during evening peak hours (17:00-19:00), whereas a clear pattern is not observed for non-commuters.They also observe that commuters have a higher number of travelling days than non-commuters and are less likely to opt for PT options when the distance between residence and workplace is far.
All these studies take advantage of the great amount of data available and apply methods that enable inference about travellers' behavioural characteristics.This work differs from the others by presenting a methodology using complete trip itineraries obtained from GPS tracking data, and approaching the lack of trip purpose by using an unsupervised clustering approach to infer home and work locations.An important characteristic of our method is the high degree of flexibility of the clustering algorithm, which is able to identify non-standard and low-frequency commuters, such as those doing night-shifts or those commuting only a couple of days per week.Differently from most studies, we make inference on complete trip itineraries, and consider all users' trips, whether made by public transport or private mode (car/ bike).Most previous works in the field use smart card data, however, the lack of a complete itinerary, or uncertainties around the origin and destination locations and ignorance of activities happening before/after, cause important indicators not to be computed or be computed under potential big errors; this could completely change some statistics and conclusions (Trépanier et al. 2007).Compared to smart card data, GPS tracking data, nowadays mostly acquired through smartphone applications (Cottrill et al. 2013), is a better alternative as it can capture all the user's movements throughout the day, thus allowing a better understanding of day-to-day variability.However, it is much more low level, as it reports only locations, sampled irregularly and with variable errors over users, time, and space.This low quality of data must be complemented with the usage of memory (i.e.aggregating GPS locations close by into activities or trips) and fusion with different data.To this end, we determine multiple imputation of labels, and exploit fusion with high-quality AVL data (which describes the actual supply).
Commuting variability is herein studied to derive aspects of regularity in commuting behaviour using descriptive statistics methods in terms of three main commuting parameters (departure times, mode/line choice and trip duration).The frequency of similar trips for each user, an aspect often ignored in the literature, is also contemplated through a simple grouping heuristics based on the same parameters utilised to analyse commuting patterns, allowing to understand which level of similarity is realized among commuting trips for the same individual.
The next sections utilise the trips identified by the mode detection algorithm in the ETH-IVT Travel Diary survey.The mode detection algorithm described in Marra et al. (2019) derives travel diaries from GPS data.The algorithm identifies activities (done in a single location) and trips (movements between activities) for each user.Afterwards, each trip is divided into stages, and the mode for each stage is identified (walk, private or public transport).For each public transport stage, the algorithm also identifies the vehicle, line, departure and arrival stops and times.Without going into the details, the main criteria to detect the mode are a low speed to identify walks, and a comparison with AVL data to identify public transport stages.By comparing the path of a public transport vehicle (from the AVL data) with the path of a user (from GPS data), it is possible to detect the vehicle used.Finally, a private stage is identified by exclusion, and can be further distinguished into car or bike, if needed.Overall, the mode detection algorithm has an average accuracy of 86.14% and has been validated in previous works on the same dataset ( Marra and Corman (2020); Marra et al. (2022)), identifying realistic mode share and estimation of route choice models.

Methodology
A crucial aspect of the ETH-IVT Travel Diary application was related to reducing the burden placed on users.An evident downside was that labelled data (such as user confirmation of modes of transport and activities for each trip) was not available.Hence, the first part of this methodology consists of imputing some geographical characteristics of the activities and trips, and the locations where activities seem to be performed more often.
Although the information on routes and the relevant modes of transport can be obtained by appropriately matching GPS time and position data with real-time public transport information (e.g.AVL data), classifying activities using only spatio-temporal data is challenging.Even some traditional unsupervised machine learning techniques, such as clustering, may require information not readily (or ever) available to the analyst for proper classification of most activities since they are user-specific.For example, a patient treating a disease may visit a hospital frequently for treatment purposes, whereas a nurse may commute daily to the hospital for work.Accurately identifying these differences is a challenge for unsupervised learning.Assuming that no other information is available, clustering can be performed based on spatial and/or time coordinates, grouping together activities that are located close to each other and/or realized at approximately the same time.One of such algorithms is DBSCAN (Ester et al. 1996), which uses the GPS coordinates along with thresholds for maximum distance (or the radius of the circle formed with a point in its centre) and a minimum number of points ('MinPts') to cluster points together.Similar applications have pointed out the good suitability of the method for this purpose (Bhadane and Shah 2020;Liu et al. 2019;Marra et al. 2022;Xiong et al. 2020).
DBSCAN is based on the concept of density connectivity, which efficiently classifies points in clusters of arbitrary shape, without the need to specify an initial number of clusters.The density of an arbitrary point is defined as the number of points within a circle of radius from that point.Then, for each point in the formed clusters, the circle of radius contains the minimum number of points specified.Two main definitions are important to form the density-based notion of a cluster (Ester et al. 1996): density-reachable and densityconnected points.A point p is density-reachable from a point q, with respect to and 'MinPts', if there is a chain of points p 1 , … , p n , p 1 = q , p n = p such that p i+1 is in the neighbor- hood of p i and the number of points in that neighborhood is greater or equal than 'MinPts' (in this case p i+1 is also called directly density-reachable from p i and p i is a core point).If there is a point o such that both p and q are density-reachable from o with respect to and 'MinPts', then p and q are said to be density-connected with respect to and 'MinPts'.Density connectivity is a symmetric relation so that two points (called border points) can belong to the same cluster even without sharing a common core point, but then it must be the case that there exists a common core point from which these border points are density-reachable.
The distance parameter plays an important role in the DBSCAN algorithm since it defines a maximum distance threshold for activities to be considered in the same location (hence, classified as the same activity).Intuitively, this threshold should be high enough to accommodate possible GPS location errors as well as locations big in size, but not too high so that it does not include neighbouring locations corresponding to other activities, which could cause, for instance, a supermarket in the neighbourhood to be labelled as 'home'.In this study, was set to 100 m, after experimenting and analysing the outcomes for several values.
Although DBSCAN automatically defines the number of clusters based on the specified inputs and 'MinPts', one of its biggest drawbacks lies in the fact that all the clusters are based solely on these two parameters, so if the data has points forming clusters of varying densities, the resulting clusters may be meaningless.In this case, a trade-off between accuracy and detail of activities is necessary for the application of GPS Tracking data.This trade-off entails the unknown nature of the activities but assumes some common behavioural patterns, such as the ones involving home and work locations.A simple rule of thumb is to define the cluster representing 'home' as the one with the most points (i.e.location with the highest number of activities), the cluster representing 'work' as the second most visited one, and all other clusters just being assigned the label 'other'.Temporal information can also be included, as in Marra et al. (2022), where 'home' and 'work' (or, equivalently, 'school' in the case of students) classifications were constrained to the clusters with the highest number of activities during nighttime and daytime, respectively, and only on weekdays.However, one of the goals of this study was to capture non-standard commuting trips (e.g.night-shifts, unusual departure times, etc.), hence, constraining classifications to be within regular work shifts did not meet the requirements for this study.To overcome this, the most and the second most visited locations according to DBSCAN were restricted to locations with a minimum weekly visiting frequency of 2 days a week and a minimum activity length of 4 h.Then, given these conditions, 'home' location is the most visited cluster and 'work' the second most visited, while all other clusters are labelled as 'other'.Under this scheme, low-frequency visited work locations can be included, but an activity duration of at least four hours would be required.This prevents, for instance, that frequent activities, such as shopping for grocery or going to the gym, are mistakenly taken as 'work'.These numbers were tuned after running the algorithm with different numbers and after comparison with the information displayed on the qualitative questionnaires, taken at the beginning of the survey.Further information and some descriptive statistics on the survey questionnaire can be found in Appendix 6.
After labelling locations, it was possible to identify all 'home-work' and 'work-home' trips, as well as 'other' trips.For studying commuting, we focus on the 'home-work' and 'work-home' trips, and take all other trips as non-commuting trips.
In the next sections, the word recurrence is adopted to indicate a repeated trip (or a trip aspect/characteristic) as it is a common word adopted in literature.Studying the recurrence of routes and, particularly, the differences between choice patterns in commuting and non-commuting trips helps draw user profiles or, in other words, facilitates the understanding of what most travellers are looking for when choosing their daily routes, and which factors, if any, are somehow relaxed when the trip purpose is not directly associated with work.The recurrence is first investigated from three perspectives: departure time, mode/line choice and trip duration.Classical descriptive statistical methods are used for these analyses.They reveal interpersonal variability.
Then, the frequency of similar commuting trips, for each individual with commuting trips in the study, is computed by considering a heuristics grouping algorithm 15 Page 6 of 17 that clusters trips sharing similar features values (in terms of the three main factors studied before).This analysis reveals the proportion of these trips that can be assumed equivalent, and/or which level of similarity is realized.Hence, the interest shifts to knowing whether, among all commuting trips for a given user, the observed choices were simultaneously recurrent.This reveals intrapersonal variability.
For evaluating the similarity of the trips, a similarity metric is proposed based on departure times, trip duration and line choice (for private trips, line is taken as 'private', and for PT trips, line is the specific label of the line(s) taken).A heuristic algorithm (reported as pseudocode in Algorithm 1) first groups, for all the commuting trips of a given user, the trips that are made with the same lines, until all trips are assigned to a cluster (if no matches exist, the trip forms a cluster of size one).Then, for each cluster previously formed (a total of I u , for each user u), new (sub)clusters are formed based on departure times and trip duration.More specifically, taking one variable at a time, trips are grouped together if they are within 5 min from each other or accessible from at least one common trip.For example, a trip with a departure time at 8:00 am and another trip with a departure time at 8:10 am are initially not grouped together, but if there exists a trip with a departure time equal to 8:05 am, for example, then the 8:10 am trip is accessible from the 8:00 am trip through the 8:05 am trip, so the three of them are grouped together.The "accessibility" time rule, which was inspired by the DBSCAN algorithm, is applied to both trip departure and trip duration variables, one variable at a time.The output is a cluster/subset of trips, for each user, which can be considered similar from the perspective of lines, departure times and duration.The similarity metric corresponds to the ratio of the number of trips in this final cluster over the total number of trips (either home-work or work-home) for that user.

Behavioural Aspects of Commuting Trips
To extract the commuting trips from unlabelled GPS data, the clustering strategy discussed in the previous section is used.After labelling locations, commuting trips from home to work and vice versa are obtained by identifying trips with origin and destination points in these locations.For visualization purposes, Fig. 1 shows two panels where each dot is one user coloured according to their commuting proportion.On the left side (yellow-red scale), the dots represent the home locations of the users, and on the right side (blue-red scale), the dots represent the work locations.The scale in the right panel has been slightly adjusted to enable showing users with work locations far from Zürich's city centre (Zürich HB).It is interesting to notice that there is a concentration of work locations in the city centre, but there are some occurrences very far from Zürich (e.g.close to Rapperswill).For the home panel, although there is a big cluster in the city centre, we can see a concentration around domiciliary neighborhoods, such as Oerlikon (about 2.5 km away from Zürich's main station).The majority of users has a proportion of commuting trips in the range 10-30%, and Fig. 1 does not indicate a strong spatial commuting pattern existing in some specific area, although it suggests lower proportion of commuting (blue circles in the right panel) for work locations that are far from the city centre.Table 1 summarizes the results obtained for activities and trips using this clustering strategy, and it also highlights the number of private, PT and mixed (private + PT) trips.
Algorithm 1 Create subsets of commuting trips based on the similarity metric Table 1 reveals that about 16% of the trips are commuting trips from home to work, and vice versa, with a clear imbalance between them, with home-work trips accounting for about 58% of these commuting trips, indicating that some trips departing from work do not have the home destination.This identifies some more complex trip chaining on the work-home trip.It is interesting to notice that commuting trips, on average, take longer than the average of all trips (29.46 min vs. 24.20 min, respectively), suggesting that many other activities are performed close to the home or work locations.In particular, work-home trips are, on average, longer than home-work trips (30.39 min vs. 28.82min, respectively).Since the clustering distance parameter was chosen to be restrictive (100 m), it is unlikely that the explanation for the difference comes from stops to other locations in the neighbourhood that are yet farther away than home.Instead, two other hypotheses seem more plausible.The first one is that traffic congestion at the end of the working hours (evening rush hours) is more intense, making work-home trips longer than home-work trips.In fact, according to traffic flow statistics in the city of Zürich (TomTom 2021) the weekly average congestion levels are slightly worse in the evening rush hours (from 16:00 to 18:00) than in the morning rush hours (from 6:00 to 8:00), with an extra time spent driving estimated as 13 min (for every 30 min driving) in the morning rush hour vs. 17 min in the evening rush hour, a 31% increase.Although this does not exactly reveal to what extent PT traffic is affected, further analyses on walking times, transfer times and times on the PT network can provide some insights on whether route choices of work-home trips are different than those of home-work trips.This study, however, does not contemplate a causality analysis between traffic awareness and route choice.Instead, it assumes that, since the trips reflect recurrent choices of the travellers, aspects related to usual levels of traffic and crowding are known, and thus, usual route choices are made accordingly.In general, both those insights identify some limits of the generally applied approach of mirroring the home-work/work-home trips when data about one of the two is missing or inconclusive (for instance, based on AFC without tap-out).
The second hypothesis, a seemingly intuitive explanation, is that commuters do not always aim for the shortest time trips when coming back home from work and, for instance, may privilege some other aspects related to comfort or even longer walks, resulting in bigger times.In fact, previous research (Li et al. 2005) has shown that commuters have more flexibility in terms of departure times and route  choice in the evening than in the morning commute.It is also plausible that commuters could perceive work-home trips differently.For example, a study by Jenelius (2018) for a high-frequency bus line in Stockholm has shown that passengers perceived journey time differently (and notably higher) than nominal journey time, particularly during the afternoon peak, where in-vehicle crowding was significant and waiting times relatively long due to large headway variability.For evaluating the plausibility of these hypotheses, and to investigate other aspects of commuting trips, in particular, travellers' behaviour when on commuting trips, the next subsections present several analyses made on the ETH-IVT Travel Diary dataset.

Recurrence in Departure Times
Figure 2 shows the histogram of deviation in minutes from average departure times, where each bin corresponds to five minutes.To calculate the time distance, for each user with at least one trip in the category "home-work" or "work-home", the average departure time is determined considering all trips of that user, then the difference (in minutes) from each trip to the average time was calculated and plotted.To avoid possible biases due to, for example, an individual that goes back home every day for lunch and comes back to work afterwards, three shifts were considered: morning (from 6:00 to 12:00), afternoon (from 12:00 to 18:00) and evening (all other times).In this case, averages were taken for each shift, and differences were calculated based on these averages.The appropriateness of the shifts can be easily assessed by checking the ranges for the values in each category.The range for home-work trips was [−140, 245] minutes and for work-home trips, it was [−246, 231] minutes, and in both cases less than five hours, which is less than the length of the intervals proposed for the shifts.
Figure 2 makes evident the concentration of departure times around mean values (corresponding to the 0 mark on the x-axis) for both home-work and work-home trips.In particular, 37% of the home-work and 28% of the work-home trips occurred within 5 min of the average departure time when absolute differences are considered, and the percentages increase substantially when a 15-min window is considered: 59% of the home-work and 45% of the work-home trips occurred within 15 min of the average departure time, indicating a pattern of regularity of departure times in commuting trips.Furthermore, 90% and 78% of the trips occurred within 60 min of the average departure times for home-work and work-home trips, respectively.The density curves in Fig. 2 reveal a bell-shaped format for the home-work distribution, whereas the curve for work-home trips is flatter, indicating that the differences in departure times for the work-home trips are more spread out, with higher variability.

Recurrence in Mode/Line Choice
This subsection proposes an analysis of recurrence in mode/ line choice.For private trips, the mode is taken to be 'private' and no other distinction (e.g.bike or car) is performed.For mixed and PT trips, the analysis goes beyond the mode (private, train, bus, and tram) and also assesses all lines Figure 3 shows the mode share analysis.
Figure 3 confirms the information displayed on Table 1, and it is possible to visually inspect that the majority of users change the mode preference when all trips are considered vs. when only commuting trips are considered.While private and mixed profiles seem to dominate over PT trips when all trips and non-commuting trips are considered (first two bars from left to right), this pattern changes to PT dominance when only commuting trips are considered (last three bars from left to right).In particular, non-commuting trips are dominated by three modes: private (40.3%), tram only (22.2%) and bus only (12.7%).On the other hand, commuting trips are characterized by lower presence of private mode (23.6%) and tram only (18.4%), but present increased share of bus only (15.9%) and bus + tram (15.6%).In terms of differences between home-work and work-home trips (last two bars on the right side), private share does not change significantly, although private + bus increases by about 3%, possibly indicating owners of bikes who decide to take the bus back home.In terms of PT, there is a small change in the share of bus trips (increase about 2% in the work-home category) and of bus + tram trips (decrease about 5% in the work-home category).The willingness to transfer in PT when commuting will be investigated next.
Assessing the frequency and regularity of a line choice reveals important characteristics of commuters.In fact, in PT the user is bound to a set of services which are serving the area close to home and work and is bound to routes in the network which are shaped by frequency and line planning.Moreover, transfers are known to be particularly important in the route choice of travellers.It is, therefore, interesting to see how many lines are typically used by users, in terms of the variety possible, and actually used.
In this analysis, we consider lines sharing stops and a part of their route as different lines.We make this assumption, since it is not possible to know if a passenger considers them as the same or different options.In fact, with each different line, a destination can be reached in different ways (e.g. from different stops, walking, with additional transfers).Moreover, in the city under study, there are rarely fully overlapping lines, but rather lines overlapping partially, and only for a few stops.
We start by contrasting the average number of lines per trip (i.e. the number of transfers), and the average unique number of lines per traveller.For non-commuting trips, each trip had an average of 1.39 lines, and each traveller used an average of 7.01 unique line numbers.For commuting trips, the average number of lines per trip was higher, 1.88, although the average unique number of lines per user was much lower, only 4.03 (home-work trips 1.89 and 3.14, work-home trips 1.85 and 3.17, for the average number of lines per trip and average unique number of lines per user, respectively).This reveals that, although travellers used, on average, more lines per trip (i.e. more transfers) when on commuting trips, their choice of lines was restricted to a lower number of lines.On top of that, most trips (68%) were made with only one line in non-commuting trips, whereas for commuting trips this percentage was 37% (similar percentages were obtained for home-work and work-home trips, or 34% and 38%, respectively).This means that 63% of commuting trips (66% and 62% for home-work and work-home trips, respectively) had at least one line transfer.
The slight discrepancy observed among the average unique number of lines per traveller in commuting trips and the subcategories home-work and work-home suggests the investigation of whether different lines are being used between these two OD pairs.In fact, for travellers having at least one home-work and one work-home trip, only for about 50% of the cases an agreement of (all) lines between both ways was obtained.In other words, although it is possible to take the same line both ways, only about half of the travellers had at least one trip home-work and one trip work-home using exactly the same line(s).For the other half of passengers in commuting trips, the lines for home-work and work-home trips never matched.However, this could also be a result of a network that offers multiple possibilities of lines serving common stops, as in the case of the Zürich network.
The frequency of the most observed line per traveller, or the "preferred" line, is useful to understand the frequency in which each individual utilised their preferred line.This frequency is obtained for each trip type and displayed in Fig. 4, where each circle represents the counts of users with the same frequency for their preferred (PT) line in a given trip category.The size of the circle indicates the count, and it ranges from 1 (the smallest circle in each category) to 20 (the big orange circle in the home-work category).Hence, for instance, if the frequency is 1.0, then all trips are made with the same line (also implying the same mode).Circles in other frequencies, especially 0.5 and 0.34, can be associated with, at least, one or two transfers, respectively (in case two or more lines have exactly the same frequency, one line is picked randomly and plotted, e.g. a traveller who always pick the same route consisting of two lines, so both lines are preferred lines with a frequency of 0.5 each).For commuting trips, if, for each user, the difference between the frequency associated with the preferred mode (similar analysis, but not displayed) and the frequency associated with the preferred line is considered, the average value (across all users) for this difference is 0.30 for home-work trips (standard deviation of 0.25) and 0.33 for work-home trips (standard deviation of 0.25).
Taking the frequency of use of each traveller's preferred line, about 20% of all commuting trips and about 19% of work-home trips are made with the use of a single (and preferred) line, contrasting with the 37% and 38%, respectively, obtained when all lines (and not only the preferred ones) are considered.Home-work trips are made with one preferred line for about 28% of the travellers, contrasting with 34% when all the single lines for the same OD pair are considered.For these cases, no transfers are made, and the travellers show a clear preference for one particular line choice, especially when commuting from home to work (vs. the opposite direction).For non-commuting trips, as expected, this percentage is much lower (about 7%), although the main reason lies in the fact that non-commuting trips incorporate trips to all OD pairs not classified as home or work, so the concentration of circles at small frequencies are not directly related to the fact that users opt for multiple lines in one journey.In general, the big circles around the frequency of 1.0 indicate that many users stick to one (preferred) line option, although the concentration of several circles below the frequency of 0.5 indicates that many travellers have multiple lines (which could be transfers in a common route) tying in first place in terms of use frequencies.

Recurrence in Trip Duration
Trip duration, including the duration of walks, transfers and modes, is a key element to understand the regularity of commuting trips.First, the recurrence of duration for the same OD pair indicates that the commuters choose routes that fit their expectation (or limit) for arrival time at their destination.Second, when the use of the same line for both ways is possible, investigating differences in the duration between the two OD pairs (home-work and work-home) may reveal travellers' behavioural characteristics linked to comfort, willingness to transfer and to walk.For example, if the duration of the home-work trips is significantly shorter than work-home trips, and that is due to an extra transfer in the first category and prolonged walk in the second, then the traveller behaviour can be assumed to change from one OD pair to opposite one (which of course has a different trip purpose).
In a logical scenario, which does not usually correspond to the real networks, since the locations of home and work are fixed, one would expect the trip duration for both ways to be about the same.Of course, external factors, such as traffic, crowding and weather, could have an impact on the route, e.g., the traveller decides to walk instead of taking the second transfer in PT because of crowding during evening rush hour.For this study, the proposed investigation does not consider the influence of such external factors, instead, the focus is on the regularity of the route choice.In this sense, the factors that have a recurrent impact (and are not only isolated or seldom observed cases, like network disruptions) are assumed as known by the traveller, so that the observed daily route is a result of the traveller's behaviour for that type of trip and prior knowledge of the internal (timetable, route, headways, etc.) and external (usual traffic and crowding conditions) factors.This should not be taken as a conservative assumption, since it is fairly general for the travellers to know the routes that best accommodate them when on commuting trips as shown by many previous works, e.g.Goulet-Langlois et al. (2016); Kusakabe and Asakura (2014); Ma et al. (2013); Ortega-Tong (2013); Zhou et al. (2014).
Figure 5 shows a comparison across the commuting trips (home-work and work-home) of walking, transfer, and travel time (further divided into bus, train, tram and private times), all given in minutes.The averages are labelled for each trip category and identified by the red triangles inside the box plots.On average, work-home trips take about 1.5 more minutes to be completed in comparison to home-work trips, with longer average walking times and transfer times (10.1 and 4.4 min vs. 8.8 and 4.0 min, respectively).Similar average times are obtained when summing over the three PT modes (10.2 and 11.5 min, respectively), and when considering private modes (5.7 and 5.4 min, respectively), indicating that, on average, the users are consistent in terms of time spent on the chosen mode of transport.One can see how trains have smaller times, due to average higher speeds.Also, because of a smaller network, they are used less by the average traveller.
Looking into the box plots in detail, although differences are small, walk times, tram times and transfer times have the position of the median and/or third quartiles more significantly shifted when examining the two panels.In fact, the median total trip times are ( 27.9 and 28.4 minutes, for home-work and work-home trips, respectively), with small differences in the the median walking times ( 6.7 and 7.2 minutes, respectively) and transfer times ( 1.8 and 0.0 min (higher presence of trips without transfers), respectively).For the 75th percentile, the differences are more pronounced, and the total time at this percentile is 35.0 minutes for home-work trips and 39.3 min for work-home trips, a 4.3 min difference.Significant differences arise in walking times ( 11.0 min vs. 13.7 min, respectively) and in transfer times ( 5.9 min vs. 7.3 min, respectively).The analysis reveals that median times of commuting trips, whether from home-work or work-home, are about the same.However, for the other half of the trips, the differences in the times between the two categories become more evident, with work-home trips taking longer than home-work trips, even if outliers are not considered.In practical terms, trips with longer duration happened more often in the work-home path, with the differences arising mostly from longer walking and transfer times.

Frequency of Similar Trips
The previous subsections considered recurrence from the perspective of isolated factors, namely departure times, mode/line choice, and trip duration.These types of analyses allow understanding not only the relative importance and main statistics of each factor in each type of trip, but also to make comparisons, when applicable, with non-commuting trip types.They reveal interpersonal variability.For commuting trips, considering all these factors together reveals the proportion of these trips that can be assumed equivalent, and/or which level of similarity is realized for each user (intrapersonal variability).In other words, it reveals how much each user "sticks" to their usual travel choices.For this analysis, the grouping heuristics described in Algorithm 1 is applied and results are discussed as follows.
Figure 6 shows the results for three metrics after Algorithm 1 is applied to the dataset.In Fig. 6, each bar represents 15 Page 12 of 17 one user and users are sorted in decreasing order of the similarity metric.Users with only one trip as a result of Algorithm 1 are not plotted nor further analysed.Hence, the top panel shows the 73 users which had at least two home-work trips grouped according to the similarity metric in Algorithm 1, and the bottom panel shows the 64 users which had at least two work-home trips also grouped according to the same metric.Figure 6 shows, for each user, the total number of trips (orange bar), the average number of trips across all clusters (blue bar), and the size of the largest cluster (maximum number of trips grouped together, red circle).It is important to emphasize that the bars are overlaid, not stacked.Hence, the orange bar values are the same as the blue bar values when they are not visible (e.g. for the first user in the top panel, the orange bar is not visible, hence, its value is the same as the blue bar, or 14).
In Fig. 6, the similarity metric corresponds to the size of the largest cluster (red circle) over the total number of trips (orange bar) or, in other words, to the relative position of the red circle with respect to the height of the orange bar.This metric equals one (or 100%) when the orange bars are not visible (the first users from left to right), and for these users, the red circle matches both the total number of trips (orange bar) and the average number of trips on clusters (blue bar).Table 2 summarizes the statistics for the similarity metric (ratio between the size of the largest cluster and total number of trips) for users on home-work and work-home trips.
Out of the 101 travellers with recorded home-work trips, 73 had at least two trips grouped together according to the similarity metric.Each bar in the top panel of Fig. 6 corresponds to one of such travellers, with an average of 7.87 trips per traveller.Of the 96 travellers with recorded work-home trips, 64 had at least two trips (bottom panel), with an average of 6.0 trips per traveller.Table 2 shows that many travellers opt for the same routes and close schedules when choosing their commuting trips.In particular, both mean and median values for the similarity metric were high (means of 73.0% and 76.0%, medians of 76.2% and 75.0% for home-work and work-home trips, respectively).Furthermore, the relative high position of the blue bars (average number of trips on clusters) compared to the orange bars (total number of trips) indicate that many users stick to a few preferred routes (clusters) and recurrent travel choices when commuting between home and work.For users where the orange bars are not visible (the first ones from left to right), 100% of their commuting trips were made with the exact same line(s), and similar duration and departure times.Furthermore, median values exceeding 75% indicate that, for half of the users in home-work and work-home trips, at least 3 out of 4 of their commuting trips belong to their most frequent cluster.
From the analyses made, when factors are considered isolated, important interpersonal differences among aspects of home-work and work-home trips can be inferred, especially in terms of average departure times, different line choices depending on direction, and trip duration (with prominent differences coming mainly from transfer and walking times).However, when considering a smaller subset of those trips based on the similarity metric for each user and for each direction of the journey, the analysis revealed that there is a significant part of commuters that are consistent with their route and time choices for both home-work and work-home trips, and although their characteristics may differ (e.g.different line choices and departure times), most trips in each category follow a default choice, indicating small intrapersonal variability.

Conclusions and Future Research
This paper explores recurrence and repeated patterns in public transport mobility, focusing on commuting trips.We develop our analysis on rich longitudinal data on commuting behaviour, based on travel diaries collected by a smartphone application called ETH-IVT Travel Diary consisting of 6838 PT and private trips of 172 users in the city of Zürich (Switzerland).The framework uses unlabelled GPS tracking data as input, building up from the selection of an adequate unsupervised clustering technique to studying behavioural aspects related to commuting trips and proposing a similarity metric to investigate the level of similarity in home-work and work-home trips for each user.These analyses consider all trips from home to work, and vice versa, to find spatial and temporal regularities in terms of mode/line, departure times and trip duration.When applicable, the results are also contrasted with those of non-commuting trips.Then, a heuristics algorithm to cluster the commuting trips into sets of similar trips is also proposed  and provides means for measuring how consistent each user is with their default route choice for commuting trips.
The main findings in the paper can be summarized as follows: (1) public transport commuters have regularity of departure times, and the majority of home-work trips and significant part of work-home trips are realized within 15 min of the average departure time ( 59% and 45%, respectively); (2) while private and mixed profiles seem to dominate over PT trips when non-commuting trips are considered (private corresponds to 40.4% and mixed 13.6%), this pattern changes to PT dominance when only commuting trips are considered (tram 18.4%, bus 15.9% and bus+tram 15.6%); (3) line transfer is frequent for PT commuters, with 1.88 lines used per trip and the majority of PT trips (63%) having at least one line transfer, whereas non-commuting PT trips are characterized by lower average of line transfers (1.39) and majority of trips without transfers (68%); (4) although the network allows taking the same lines in both directions (home-work or work-home), only for about 50% of the PT commuters there was at least one trip with an agreement between the lines used in both directions; (5) important differences between the duration of home-work and work-home trips arise in transfer and walking times and, although the median values are similar, the analysis on the upper quartiles reveals a difference of 4.3 min (work-home trips taking longer), with 2.7 min of difference in walking times and 1.4 min in transfer times, longer times for work-home trips; (6) a significant part of the commuters are consistent in their route choices in terms of lines, departure times and trip duration, with the similarity metric demonstrating that for half of the commuters, 3 out of 4 home-work or work-home trips can be considered "equivalent" from a route choice perspective, more specifically, the proportion of similar trips among the home-work trips has a mean of 73%, while work-home proportion has a mean of 76%.
In general, these results validate the regularity of commuters' behaviour in both private and PT trips.The analyses conducted yield three pivotal conclusions that hold significant potential for assisting policymakers in delivering better services.The first conclusion pertains to contrasting commuting trips with noncommuting trips and the associated changes in modes and travel preferences.In particular, the majority of users prefer public transport options when traveling between their homes and workplaces.Remarkably, a significant proportion of these commuters demonstrate a willingness to transfer more and utilise mixed modes during their commuting journeys.This preference is particularly evident when examining Fig. 1, which illustrates that commuters with a higher proportion of commuting trips tend to reside and work in close proximity to the city's main station, benefiting from exceptional connectivity.Additionally, the proportion of commuters using PT decreases noticeably for work locations situated farther away.These observations strongly suggest that users are inclined to choose PT for their commutes when accessibility and reliability are adequately provided.In contrast to commuting trips, the mode share for non-commuting trips predominantly favours private modes of transportation.This finding presents an opportunity for further investigation into the underlying reasons for this behaviour.Understanding the motivations behind the preference for private transportation in non-commuting contexts could facilitate the identification of necessary changes to attract more users to PT modes for such trips.
The second main conclusion highlights the disparities observed between home-work and work-home trips, shedding light on important distinctions that can guide policymakers in refining their service offerings.In terms of private commuting, no significant differences in vehicle travel times were observed when contrasting home-work and work-home trips, indicating that both directions are subject to similar traffic conditions.In terms of PT travel times, differences in walking times and transfer times indicate that the experience of commuters may differ depending on the direction of their journey, with work-home trips exhibiting greater walking and transfer times and higher variability of departure times.Additionally, work-home trips were found to be less frequent than home-work trips.Moreover, for PT trips, there was limited agreement between the lines used for home-work and work-home journeys, suggesting a higher level of complexity in trip planning.These findings highlight the complexity of the choice process, especially in work-home trips, which often involve intricate trip chaining and, therefore, less predictability, making their forecast and information provision more complex for service providers.Notably, the study did not include analysis of chained trips, indicating the need for future research to explore the effects of trip chaining and develop suitable metrics to identify regular patterns of behaviour in these trips.
Finally, the third conclusion relates to the regularity of travel behaviour in terms of the three main metrics studied: departure times, line/mode choice and trip duration.The analysis reveals that users exhibit remarkable consistency in their route choices when differentiating between home-work and work-home trips.The proposed similarity metric demonstrates that a median of users has at least 3 out of 4 trips that can be considered equivalent from a route choice perspective.This consistency underscores the existence of habitual travel behaviour, where individuals tend to adopt preferred and familiar routes for their regular commuting journeys.Understanding and leveraging this regularity can assist policymakers and service providers in several ways, for example, by allocating resources efficiently, optimizing schedules, and enhancing the overall travel experience for most users.Furthermore, these conclusions highlight the importance of ensuring accessibility, reliability, and convenience in PT options for commuting journeys, and they emphasize the need for policymakers to explore strategies that incentivize the use of PT modes in non-commuting scenarios to achieve a more balanced and sustainable transportation system.This research path could be extended in many ways.Upon a subset of similar trips, where the traveller's behaviour can be inferred, it is useful to contrast the recurrent path choice of the traveller with the alternatives available to investigate, for example, the traveller's willingness to walk, make transfers, and even the behaviour in case of disruptions.To do this, the unchosen alternatives (route choice set) need to be identified.By analysing the differences, considering the planned or the actual timetable (with delays), one can understand the role of delays and online information, in the choice process of users.This is especially true for commuters, who have a combination of good knowledge of the service, and exposure to regular non-performance, and thus might have very interesting patterns of dealing with recurrent and non-recurrent disruptions.
Another idea for future research arises from a key change in mobility patterns resulting from more flexible work environments, which includes more days spent in home office, for example.Following this path, this research could be extended in the sense of looking not only at home and work OD pairs but also at all different OD pairs which represent regularly visited locations.This could be used, as in Vander Laan et al. (2021), to gain insights on the distribution of trip origins/destinations and investigate OD pairs of traffic travelling through congested corridors.Fusion with other sources of data, such as sociodemographic and vehicle registration data, would allow to make inferences on the household income, age, race, etc., of travellers and vehicle types that use the congested segments.Lastly, a more comprehensive statistical modelling and understanding of features related to PT mobility can be derived, based on the understanding of the underlying stochasticity (Blume et al. 2022).

Fig. 1
Fig. 1 Proportion of commuting according to home and work locations.Left panel: home locations.Right panel: work locations

Fig. 2
Fig. 2 Trip counts of deviation in minutes from average departure time stratified by homework and work-home trips

Fig. 4
Fig. 4 Counts of frequencies of the preferred (PT) line per trip category

Fig. 5
Fig. 5 Box plots: breakdown of trip times (home-work vs. work-home)

Fig. 6
Fig. 6 Similarity metrics for users with home-work and work-home trips (orange and blue bars are overlaid)

Table 1
Summary of activities according to chosen clustering strategy

Table 2
Statistics for the proportion of similar trips across all users in commuting trips