Introduction

The acquisition and dissemination of individual data are key for research in many disciplines, including epidemiology, social simulation, economics, and engineering. Wide access to individual data has unprecedented benefits for the analysis and modelling of heterogeneous individual behaviour and is highly valuable when supporting decision-making. For instance, during the COVID-19 pandemic, individual data could be used to develop contact tracing and social distancing strategies to limit the spread of the disease and to recover from the pandemic.

However, governments and individuals across the globe have increasingly become concerned over privacy and the exchange of personal data. Data owners such as statistical and operational agencies that disseminate the data are often ethically and legally required to protect the personally identifiable information of individuals. Agencies often implement simple data de-identification measures such as removing obvious identifiers such as names and addresses. In the case of Smart Card data in public transport (the data of the payment cards that often store the tag-on and tag-off locations and times of individual transit riders), data owners often perform a ‘data masking’ measure, e.g., placing a unique hashed identification number over the original Smart Card ID. Simple data de-identification measures, such as data-masking, top-coding, adding noise, or random data swapping, are not sufficient to protect individual confidentiality (Drechsler and Reiter 2011). In the case of Smart Card data, intruders may observe the same Smart Card user over a period of multiple days and locate an individual’s home location and personal travel patterns and trajectories. This poses a major risk of privacy breaches for vulnerable individuals, such as senior or child cardholders, members of ethnic or religious minorities at high risk of hate crimes, and women and members of the LGBTQ+ community. This risk thus prevents the wider dissemination of personal data such as Smart Card data to the wider research community and limits the impacts of research on policy making. On the other hand, as the risks of personal data disclosure increase, the alterations made by data owners using simple data de-identification techniques may impact the usefulness of the released data.

To address the limitations of standard de-identification measures, literature has offered various approaches aiming at generating partially synthetic or fully synthetic data from real data. The idea is to analyse and model the real data to preserve the ability to query from that data, but not to reveal raw individual data points. In other words, data synthesis aims to retain the distributions in the data, e.g., the mean, standard deviation, and potentially even the probability distribution of the data, but each data sample does not represent a real person in the raw data. Synthetic data enable public dissemination of the data while protecting individual privacy and preserving data utility. With higher quality synthetic data, analysts can develop meaningful and relevant research that can contribute to decision-making. Data owners, who are generally policy makers, can also benefit from access to cutting-edge models and synthesis methods that can be directly implemented on the real data.

While synthetic data generation has attracted great interest and proved effective for images (Karras et al. 2020), music (Briot et al. 2020), and texts (McKeown 1992), the existing studies in transport synthetic data are still currently limited. Most existing works focus on creation of synthetic population for activity-based travel models (Axhausen and Gärling 1992), where various methods such as Iterative Proportion Fitting (Ruschendorf 1995), Maximum Cross-Entropy (Guo Jessica and Bhat Chandra 2007), and Bayesian Network (Sun and Erath 2015) have been developed for the purpose. More recent studies have focused on ultilising more data-driven methods to discover and model hidden mobility patterns, such as Input–Output Hidden Markov models (Yin et al. 2017), semi-Markov models (Samiul and Ukkusuri Satish 2017), and Generative Adversarial Networks (Badu-Marfo et al. 2020).

The main objective of this paper diverges from existing research efforts as it seeks to generate a synthetic version of a large transport dataset that preserves the format and spatio-temporal mobility patterns of individuals, while ensuring that each data point does not represent any individual person. Specifically, we focus on Smart Card data, which has become the prevalent standard in modern public transport systems. Although the availability of Smart Card data has opened up new avenues for research in intelligent transport systems, such as the analysis of travel behaviours (Kieu et al. 2015), identification of trip purposes (Lee and Hickman 2014), and transfer intentions (Kieu et al. 2017), its potential has not been fully realised by the research community, while policy makers who have access to the raw data remain largely unaware of cutting-edge research possibilities. This paper proposes a framework that connects Smart Card data owners to a broader research community through the creation of synthetic data. Additionally, this paper offers a superior option for public data dissemination to public transport agencies, research centres, local councils, and other Smart Card data owners.

This paper addresses three key scientific challenges related to Smart Card data: non-Gaussian distributions, mixed data types, and highly imbalanced and multi-modal data. Real-world Smart Card data are often non-Gaussian, and may include both continuous values (such as tag-on and tag-off timestamps) and discrete values (such as Route, Direction, or Zones). Additionally, the data are frequently highly imbalanced, with certain discrete fields such as Zones or Routes being much more popular than others, and the continuous fields such as tag-on and tag-off likely exhibiting bi-modal distributions reflecting the two peak periods. The scientific contributions of this paper are twofold:

  • We apply a Generative Adversarial Network and a Bayesian Network to model and generate synthetic Smart Card data

  • We compare and contrast the two methods mentioned above, discussing the advantages and disadvantages of each for the data synthesis problem.

The remainder of this manuscript is structured as follows. Section “Related Works” reviews the literature on synthetic data generation for sequential data. Section “Methodology” discusses three important synthesis methods applied in this paper and compares and contrasts them in the context of synthetic data generation. Section “Numerical Experiments” discusses the application of these synthesis methods to Smart Card data from South East Queensland, Australia’s public transport network and compares both the modelling process and synthesised data from the methods. Finally, Sect. “Discussion and Conclusion” concludes this study and suggests several directions for future research.

Related Works

Classical Data De-identification Techniques

Classical data de-identification techniques often aim to create a version of the real data that does not contain or only contains synthetic information about individuals. Common techniques often focus on altering the raw data using techniques ranging from simply removing sensitive attributes (e.g., ID, name, age, and income) to aggregating geography, swapping data across records, and adding noise to the data (Willenborg and Waal 2001). Mendes and Vilela (2017) provides a survey of the most common privacy-preserving data synthesis techniques, which can be classified into four main classes:

  • Generalisation: replacement of values with more general ones, e.g., numerical values being generalised as an interval

  • Suppression: removal of sensitive attributes, e.g., name or ID

  • Anatomisation: de-association of sensitive attributes to multiple tables, making re-identification more difficult

  • Perturbation: replacement of the raw data with synthetic values with identical information or swapping data between attributes.

However, recent research examples show that these classical anonymisation techniques fail for ‘big data’ (Lu et al. 2014; Shrivastva et al. 2014) for two reasons. First, they often protect privacy by reducing the utility of the data (Purdam and Elliot 2007). Second, even the strictest classical data anonymisation techniques may fail to preserve confidentiality from a nefarious person who is reasonably competent in employing investigative techniques (Paul 2009). Sweeney (2001) showed that 97% of the names and addresses on a voting list can be identified using only ZIP codes and date of birth.

The Generation of Synthetic Transport Data

To address the shortcomings of classical data anonymisation techniques, generative models often aim to learn probabilistic distributions in the original data and then generate completely synthetic versions of the data. Unlike classical data anonymisation techniques, which mainly focus on releasing altered versions of the raw data, generative models aim to learn the probability distributions in the real data to draw new and synthetic samples that retain the distributions and correlations, but do not contain any record of real individuals.

Synthetic data, in general, should have the same format and statistical distribution as in the real data. This is to ensure that data users can get the same insights from the synthetic data as from the real data. Conversely, to maximise their benefits, synthetic data should be able to be used in place of the real data in data-driven models, or in other words, models trained on the synthetic data should be able to be applied directly to the real data. This allows data owners to implement data-driven models trained on synthetic data directly to the real data. To this end, there is a vast and recent literature on the use of generative deep learning methods, such as Generative Adversarial Networks (GAN) (Goodfellow et al. 2014), generative stochastic networks (Bengio et al. 2014), Bayesian Networks (Deeva et al. 2020), Hidden Markov Models (Ismaïl et al. 2016), and variational autoencoders (VAEs) (Kingma and Welling 2014). In recent years, these methods have been extended to cater various specific data and have achieved notable successes with images (Karras et al. 2020), music (Briot et al. 2020), texts (McKeown 1992), and tabular data (Xu et al. 2019). However, there are generally fewer combinations of letters and musical notes than spatial locations and timestamps in travel activities.

The generation of synthetic travel activities is related to a major class of transport models: activity-based models (Rasouli and Timmermans 2014; Axhausen and Gärling 1992), as they also aim to reproduce realistic individual travel activities. There are a rich literature of studies focusing on synthetic population for activity-based travel models, as these models often require specific data of location and timestamps of individual travels. Classical methods in this topic include the deterministic approaches such as Iterative Proportion Fitting (Ruschendorf 1995) and Maximum Cross-Entropy (Guo Jessica and Bhat Chandra 2007), as well as the probabilistic methods such as Bayesian Network (Sun and Erath 2015).

Recent studies have proposed more data-driven methods that can directly learn from data and produce synthetic samples of similar data format and statistical distributions. To this end, there are Markov-based studies that aims to model human travel activities as probabilistic and stochastic changes of discrete states. Samiul and Ukkusuri Satish (2017) aims to reconstruct activity–location sequences of travel activities from online social media data with imcompleted information. A semi-Markov modelling approach was proposed, along with a particle-based Markov chain Monte Carlo sampler to perform parameters inference for the model. Jiang et al. (2016) proposed TimeGeo, a mechanistic modelling framework that aims to synthetically generate travel activities with high resolution. TimeGeo integrates a Markov chain to generate temporal travel patterns with a rank-based exploration and preferential return (r-EPR) mechanism to generate spatial patterns. Yin et al. (2017) proposes an Input–Output Hidden Markov models that utilises contextual information to address the limitations of the standard Hidden Markov models due to its uniform transition and emission probabilities. Also, based on a Markov chain, Pappalardo and Simini (2018) extend the work by Jiang et al. (2016) by proposing a mechanistic model that is more data-driven and parameter-free. The proposed model in Pappalardo and Simini (2018) aims to generate individual trajectories in fine spatio-temporal resolution. Their proposed model generates a mobility diary by learning from the data, and then generates a mobility trajectory based on the concept of preferential exploration and preferential return. Although mechanistic models, such as Pappalardo and Simini (2018), Jiang et al. (2016) and Yin et al. (2017), are often interpretable, they can lack the complexity to model complex mobility patterns of a large number of individuals (Choi et al. 2021). For this reason, deep-learning-based generative mobility modelling has recently become the emerging approach in the literature. Badu-Marfo et al. (2020) developed a Generative Adversarial Network (GAN) to synthesise daily travel activities. The GAN model was trained in a differentially private manner to guarantee the privacy of the synthesis seed sample members. Choi et al. (2021) proposed a generative adversarial imitation learning framework based on GAN to generate trajectory data in urban transport. Individual decisions on the road network were modelled as a partially observable Markov decision process.

The Generation of Synthetic Smart Card Data

This paper aims to create a simulated edition of a vast transportation dataset that retains the structure and temporal movement behaviours of individuals, while guaranteeing that no individual’s data point is identifiable. The case study is the generation of synthetic Smart Card data in public transport. Smart Card data have recently been utilised for various purposes, but the data have only been accessible to a very limited number of researchers. If a wider research community could access the (synthetic) Smart Card data, we might be able to see many more innovations to support policymakers in making informed decisions in public transport. For instance, exploratory data analysis or simulation models can be developed on the synthetic dataset and then applied on the real Smart Card dataset by data owners.

To this end, Bouman et al. (2017) proposed an approach for the generation of synthetic Smart Card data. However, the proposed Markov Chain was not calibrated against real data, making it impossible to evaluate the similarity of the synthetic data to the real data. In this paper, we focus on a more data-driven approach to generate synthetic Smart Card data that are as similar as possible to real data. To this end, we identify several challenges in the development of Smart Card data synthesisers:

  • Non-Gaussian distributions: the real-world Smart Card data are unlikely to be Gaussian, which may lead to the vanishing gradient problem during data normalisation (Xu et al. 2019)

  • Mixed data types: Smart Card data often consist of both continuous values (e.g., tag-on and tag-off timestamps) and discrete values (e.g., Route, Direction, or Zones).

  • Highly imbalanced and multi-modal data: Smart Card data are often highly imbalanced, especially on some discrete fields, such as Zones or Routes, as some are more popular than others. The continuous fields such as tag-on and tag-off are likely bi-modal, reflecting the two peak periods.

Methodology

Data Description

This paper uses real, individual Smart Card data transactions from Brisbane, Australia as the case study to show the effectiveness of the proposed methods. Smart Card has been widely used as public transport fare card around the world. A version of Smart Card is now available in many major cities, such as the Oyster Card in London (UK), Opal Card in Sydney (Australia), or Hop Card in Auckland (New Zealand). This paper uses raw Smart Card data of a 3-month period of bus trips associated with the South East Busways (Translink 2019). The data consist of over 1 million Smart Card transactions from July to October 2015. The dataset is provided by Translink, the transit authority of South East Queensland, Australia. We have processes the data to remove the unnecessary fields for this project, and to remove the individual Smart Card identifier. Each processed data record contains the following fields:

  • Trip ID: Unique identifier for each bus trip

  • Route: Bus route (e.g., 555)

  • Direction: Direction of travel (Inbound or Outbound)

  • Origin zone: The zone ID of trip origin (1–23)

  • Destination zone: The zone ID of trip destination (1–23)

  • Month: Month of the trip (July–Oct)

  • Weekday: Monday–Friday

  • Tag-on: The time a Smart Card user enters the bus (e.g., 08:23 AM)

  • Tag-off: The time the Smart Card user leaves the bus (e.g., 08:47 AM).

The Origin and Destination (O &D) zone utilised in this study adheres to the pre-2016 policy of Translink, featuring a roughly concentric circular system of 23 zones. The first zone encompasses the central business district of Brisbane, while the 23rd zone represents the farthest location from Brisbane, namely Noosa. Figure 1 illustrates the zoning system in the dataset (Wallner et al. 2018).

Fig. 1
figure 1

South East Queensland Translink’s zones

Generative Adversarial Network (GAN)

Generative Adversarial Networks (GANs) are generative models in deep learning that aim to discover the patterns in input data and then generate new data observations that are very similar to the original dataset. The core idea of GANs is the use of two sub-models: a generator model that generates new observations, and a discriminator model that classifies the generated observations as either real or fake data. The two sub-models are trained subsequently in a game theoretic zero-sum game, until the discriminator cannot differentiate the generated from the real observations for half of the time, which means that the generator is capable of generating valid observations. More details on GAN can be found in the original paper by Goodfellow et al. (2014).

GANs and their numerous extensions have been emerging in recent literature for various generative purposes, such as generating images (Karras et al. 2020) or health data (Yale et al. 2020), due to their performance an versatility in modelling and generating synthetic data. We are particularly interested in Tabular Conditional GAN (CTGAN), an extension of GAN that excels in modelling and generating mixed tabular data of continuous and discrete variables (Xu et al. 2019). CTGAN excels in modelling and generating mixed tabular data of continuous and discrete variables, similar to our Smart Card data. CTGAN has been proven to outperform many other data generative methods in the original paper (Xu et al. 2019) and several specific applications, such as generating synthetic insurance data (Kuo 2019) or diseases data in agriculture (Ahmed et al. 2021).

Compared to the original GAN, CTGAN introduces two new techniques enhancing the quality of data generation for tabular data such as our Smart Card data: mode-specific normalisation and conditional sample training. Mode-specific normalisation enhances the modelling of multi-modal distributions in numeric variables, and conditional sample training helps with the imbalanced level frequencies in categorical variables in the data. Both of these are necessary for our numeric variables (both tag-on and tag-off have multiple peaks corresponding to the morning and afternoon peak periods) and categorical variables (variables such as Origin zone and Destination zone are imbalanced, with the majority of people travelling to and from the Zone 1—the city centre in our data).

Bayesian Networks (BN)

In addition to the Generative Adversarial Network approach, a statistical approach using Bayesian Network models was developed for comparison and to explore another avenue for synthetic data generation. A Bayesian Network (BN) (Sun and Erath 2015) is a directed acyclic graph (DAG) with a variable and a conditional probability associated with each node: the distribution for each variable is conditioned on the variables upstream of it in the DAG. A joint probability distribution for the variables can then be fitted on the graph and sampled from, generating a synthetic sample. BN and GAN represent two very different approaches to synthetic data generation. The BN tackles the problem as a statistician might: an appropriate, well-understood model is fitted to the data, parameters are fitted to the model, and the model can then be studied or used to generate synthetic data.

One distinction of great importance in the current context is that the BN approach is much more transparent than the GAN approach. Shape learning gives an easily understood dependency structure, parameter fitting using maximum-likelihood estimation is a well-understood statistical method, and forward sampling is a robust sampling method. Between all these criteria, it is easy to deduce why the model does what it does. The GAN method, by contrast, is highly opaque: it is difficult to understand how the network learns and makes its decisions. This is important in a context where it is necessary to be able to prove that the data are anonymous, does not correspond to any particular individual, and does not expose aggregate information that could be problematic. On the other hand, Bayesian approaches are much less easy to generalise: they tend to require significant understanding of the underlying dataset and its statistical qualities, they often require the data to be preprocessed to allow a good model to be fitted, and they do not handle hybrid continuous–discrete datasets very well, usually requiring continuous data to be discretised to allow for good performance. Finally, temporality can be a challenge for Bayesian Networks: while Dynamic and Continuous-Time Bayesian Network (Yan-Feng et al. 2015) formulations do exist to handle temporal data, they are fairly new techniques and require the temporal dependency to be structured in a fairly restrictive manner.

The modelling process for data generation using Bayesian approaches takes place in three broad stages. In the first stage (shape learning problem), the shape of the dependency graph for the data is learned and plotted using one of a number of possible algorithms. The model’s parameters are then fitted using an Expectation–Maximisation approach, and finally, a synthetic data sample can be generated by forward sampling from the fitted model. As shape learning is the first stage of the modelling and generation process, it is also discussed first here.

For simple cases (those with few variables in the dataset) of the shape learning problem, an exhaustive search of all possible dependency graph structures scored by a criterion such as the Bayes Information Criterion can work well. However, as the number of possible DAGs on n labelled vertices follows the sequence 1, 1, 3, 25, 543, 29281, 3781503..., an exhaustive search becomes prohibitively expensive after about four or five variables in the dataset, meaning that more sophisticated search methods are needed.

The Hill Climbing Algorithm is a greedy algorithm that begins with the hypothesis of a disconnected network, and then adds one arc at a time on the basis of whether the addition increases the scoring criterion or not. This runs fast and uses resources efficiently. The major issue with the Hill Climbing Algorithm is that it terminates when it hits a local minimum, and as the shape learning problem is almost certainly not convex, the chances of the algorithm generating an optimal solution for anything except the simplest problems are small. Still, the algorithm has the advantage of being fast, especially by comparison with a constraint-based search, and it often gives remarkably good results.

An alternative approach to Hill Climbing is the Constraint-Based Search approach. This algorithm relies on identifying variables in the data that are independent of each other using the chi-squared test for conditional independence. Having identified the set of conditionally independent variables, a subset of all possible graphs fitting the some constraints. As the search space is considerably reduced by the constraints, the space can be efficiently searched without consuming excessive computational resources. The Constraint-Based search creates graph structures that are in general more connected than the Hill Climbing approach and takes longer to run, but it is less likely to get stuck on a local optimum in the way that the Hill Climbing algorithm is. In this paper, shape learning on the dataset was performed using both the Hill Climbing and Constraint-Based search approaches.

Having fitted a DAG, parameters can then be fitted to the generated DAG. In the process of parameter learning, the goal is to estimate the values of the conditional probability distributions (CPDs). The most popular parameter estimation method is maximum likelihood (ML), which uses relative frequencies or simply the frequency of occurrence for each state of the variable. Although the ML estimator is simple to use, it suffers from overfitting when the observed data does not represent the underlying distribution. Even with a large sample size of training data (such as in our case study with the Smart Card data), the conditional state counts for each parent configuration cause significant fragmentation, particularly when a variable has multiple parents with numerous states. As a result, ML estimation is fragile and unstable for learning Bayesian network parameters.

To overcome this, Bayesian parameter estimation is a useful approach for mitigating the overfitting associated with ML estimation. The Bayesian parameter estimator employs prior CPDs to express our pre-existing beliefs about variables before observing the data. The estimator updates these priors by incorporating state counts from the observed data. These priors can be considered as pseudo-state counts that are added to the actual counts before normalisation.

Finally, having trained a generative model, a synthetic sample can be created by forward sampling from the joint distribution.

Evaluation of Synthetic Data

Here, we propose a tri-step evaluation procedure for the quality of synthetic data generated by BN and GAN, as follows.

The first evaluation step entails analysing the probabilistic spatial–temporal distributions of the synthetic data in comparison to real Smart Card data. This involves examining distributions that relate directly to the variables of the synthetic and real data, as well as those that result from multiple variables (such as travel time as the difference between \(Tag-on\) and \(Tag-off\)). Additionally, statistical evaluations of these distributions are conducted using the two-sample Kolmogorov–Smirnov (KS) and Chi-squared (CS) tests. We also introduce a new spatial distribution distance index to explore the performance of generated origin–destination pairs and the real data.

The existing statistical tests and distance indexes fall short in adequately comparing mixed data. To address this, the second evaluation step employs optimal transport theory, specifically the Wasserstein metric, which can measure the distance between probability distributions, even if the data are not on the same probability space. The Wasserstein metric computes the cost of moving one distribution to match another, with smaller distances indicating closer distributions.

Finally, to comply with privacy regulations and prevent exposing real data to end-users, synthetic data samples are generated using the BN and GAN methods for the final evaluation step. The training data are then examined for duplicates, and this process is iterated using a bootstrapping procedure.

Numerical Experiments

We implement both GAN and BN on the Smart Card data from South East Queensland, Australia. All the developed codes, fitted models, and generated synthetic Smart Card data (excluding the original Smart Card data) are available at an Open Science Framework repository at osf.io/2xep5/.

Bayesian Network Implementations

We first convert the timestamps for \(Tag\_on\) and \(Tag\_off\) into minutes from midnight to make them continuous variables ranging from zero to 1440 min. Because BN learns categorical/discrete variables better, we convert the continuous variables (\(Tag\_on\) and \(Tag\_off\)) to categorical only in a dataset that we use specifically for BN, while keeping them as continuous for the GAN. Experiments show that a round to the nearest 5 min interval works best for BN (e.g., if \(Tag\_on\) equals 451.4 min, we will round the value to 450 min).

In BNs, a graphical model known as a directed acyclic graph (DAG) is used to depict probabilistic connections between variables. The edges between nodes in a BN reflect conditional dependencies between the variables, and each node in the DAG represents a random variable. A directed edge from node A to node B shows that variable A directly influences variable B. Shape learning for the BN was performed with bootstrap aggregation (\(n'\) = 50,000). Constraint-Based learning and Hill Climbing algorithms for shape learning were both evaluated, but while both learned similar dependency structures the Constraint-Based learning algorithm tended to over-fit the dependency structure, leading to a less parsimonious model. The final dependency structure chosen for the Bayesian Network was thus the one learned by the Hill Climbing Algorithm, which is shown in Fig. 2. Each node in the graph corresponds to a random variable in the dataset, and each directed edge between nodes represents a conditional probability distribution.

Fig. 2
figure 2

Dependency structure learned by the Hill Climbing Algorithm

We can read the arrows in Fig. 2 as directed conditional probability distribution, which can be represented mathematically using Bayes’ rule. For instance, the arrow from \(Tag\_on\) to \(Tag\_off\) can be mathematically written as

$$\begin{aligned} {\mathbb {P}}(Tag\_on|Tag\_off) = \frac{{\mathbb {P}}(Tag\_on, Tag\_off)}{{\mathbb {P}}(Tag\_on)}, \end{aligned}$$
(1)

where \({\mathbb {P}}(Tag\_on, Tag\_off)\) is the joint probability distribution of \(Tag\_on\) to \(Tag\_off\), and \({\mathbb {P}}(Tag\_on)\) is the marginal probability distribution of \(Tag\_on\). Here, the value of \(Tag\_off\) solely depends on \(Tag\_on\) and does not depend on any other variables. We can see a very similar relationship between ZoneStart and ZoneEnd, but ZoneEnd is also dependent on the choice of Route.

The starting node of the DAG seems to be the node Route. The arrows from Route to ZoneStart, ZoneEnd, Direction, and \(Tag\_on\) show that all the later variables depend on the choice of Route. This makes intuitive sense as a Route may only be available at specific zones, direction, and time.

Other relationships from the DAG include the fact that the ZoneStart node directly influences the ending (ZoneEnd) and the direction of trips, and this choice will also affecting the choice of start time (\(Tag\_on\)). The \(Tag\_on\) node is only dependent on the Direction and Route, as no other variables should affect the individual decision on the time they would start their trip. Information about previous trips by the same commuter might have an influence, but as commuter identifiers are not included in the current database, there is no information available regarding this data. The shape learning algorithm has thus fitted a coherent, sensible dependency structure that allows for the decision processes of commuters to be understood.

The conditional probability distributions represented by the arrows in a BN can be estimated from data using maximum-likelihood or Bayesian methods. We chose Bayesian parameter estimation as this method returns relatively similar parameters in our ten replication tests using a sample of 50,000 random trips from the Smart Card dataset. Similar tests using Maximum-Likelihood returns unstable parameters and thus are not chosen for the final analysis. Uniform priors are chosen for the Bayesian parameter estimation method, where all states are deemed equiprobable. A synthetic sample of 300,000 trips was then generated from the fitted model by forward sampling.

Finally, we implement a Chi-squared test to confirm that variables from the connected nodes in our BN are significantly correlated to each other. Table 1 shows that all the arc between source and target nodes are significantly correlated.

Table 1 Chi-squared independence test of BN’s variable nodes

Generative Adversarial Network Implementation

We implemented GAN (Xu et al. 2019) using the following architecture for the generative adversarial network. The specific hyperparameters have been largely found by trial-and-error experiments.

  • 1000 epochs and with \(batch\_size\) of 100. The number of epochs decides the number of iterations that GAN will perform to optimise its parameters. \(batch\_size\) decides the number of samples used in each step

  • \(embedding\_dim\) = 128. Size of the random sample passed to the Generator in GAN.

  • Dimension of the Generator \(generator\_dim\) = (256,256,256)

  • Dimension of the Discriminator \(discriminator\_dim\) = (256,256,256)

  • Learning rate of the Generator \(generator\_lr\)= \(2e-4\), and Discriminator \(discriminator\_lr\) = \(2e-4\).

  • Weight decay of the Generator’s \(generator\_decay\)= \(1e-6\) and Discriminator’s Adam Optimiser \(discriminator\_decay\)= \(1e-6\)

  • Number of Discriminator’s updates for each Generator update, \(discriminator\_steps\) = 1.

Evaluation Step 1: Probabilistic Evaluation of Synthetic Data

As detailed in Sect. “Methodology”, the initial stage of our evaluation procedure for the produced synthetic data involves a probabilistic analysis of the generated distributions in comparison to the real data. We posit that the synthetic data ought to exhibit equivalent probabilistic distributions to the authentic Smart Card data.

Distributions of \(Tag\_on\) and \(Tag\_off\)

The first distributions of interest are the distributions of \(Tag\_on\) and \(Tag\_off\) time, or boarding and alighting time of passengers to transit vehicles. We expect that both of these distributions are bi-modal, as there should be more people who use a transit service during morning and afternoon peak periods. Figure 3 displays the generated distributions of the tag-on times and tag-off times (in minutes from midnight) for each of the models discussed above against the real data:

Fig. 3
figure 3

Distribution of tag on/off times for the real dataset and two models

As expected, the distributions of \(Tag\_on\) and \(Tag\_off\) are both bi-modal with peaks during the morning and afternoon rush hours. The morning peak has a slightly higher density compared to the afternoon peak. The afternoon peak generally splits into a smaller and then the major peak. Note that \(Tag\_on\) and \(Tag\_off\) are both continuous variables (minutes from midnight) in the real data. While GAN learns this continuity directly from the data, we convert the variables to categorical (to a discrete 5-min interval) to facilitate learning in BN. While both BN and GAN broadly fit a mixture of normal distributions similar to the underlying data, it is clear from the plots in Fig. 3 that the distribution of the real dataset is best approximated by the BN, which has almost identical properties. GAN fits a similar structure, although it appears to overestimate and misplace the peaks. GAN overemphasises peaks in the data, meaning that a dataset generated from the GAN would overpredict uncommon events. Overall, we may conclude that both algorithms can fit the data well. Although BN only learns a discrete version of the data, it shows a better fit, especially at the peaks and the tails of the distribution.

We then compare the real \(Tag\_on\) and \(Tag\_off\) data with generated data using a statistical significance test. The statistics will show us whether the generated data are significantly different to the real data, and which generated data (between GAN and BN) are closer to the real data. We adopt the two-sample Kolmogorov-Smirnov (KS) test to measure the distance between the cumulative density function of \(Tag\_on\) and \(Tag\_off\) in the generated data and the equivalent column in the real data. The output for each variable is the difference between 1 and the KS test D statistic, which means that the higher the output, the closer the generated data are to the real data. The outputs from KS test shows that BN (KS test output equals 0.987) performs better compared to GAN (KS test output equals 0.887).

Generated Travel Time

We then look at the generated travel time, versus the real data. Travel time is challenging to capture for our data synthesis models, as it is the by-product of the generated tag-on and tag-off timestamps from individual riders, as travel time has been estimated as the difference between tag-on and tag-off time. Figure 4a and b show the inbound and outbound mean travel time from BN, GAN, and the real data. The x-axis shows the time of tag on, while the y-axis shows the travel time of the associated trip. The mean travel time is calculated for each tag on interval of 15 min.

Fig. 4
figure 4

Mean travel time at different time of day

Figure 4 shows that BN can generate very similar mean travel time to the real data. Several traffic patterns are replicated by BN, e.g., the morning peak travel time for the inbound direction and the small afternoon peak travel time for the outbound direction. Here, the real data have some noise during early morning and close to midnight, and thus, BN also captures some of the noisiness in the data. On the other hand, generated travel time from GAN seems to be around the 25 min mark, and thus, the overall trend of travel times has not been captured well by GAN. We can further confirm this by looking at the distribution of generated travel times versus the real data in Fig. 5. This figure shows that, overall, BN can replicate the travel time better than GAN, but both the algorithms struggle to capture the peak travel time. Similar to the patterns that we see in Fig. 3, GAN captures the overall distribution of travel time, but provides limited fit near the peak and tail of the distribution.

Fig. 5
figure 5

Distribution of travel time

Distributions of Origin and Destination Zones

These variables are both categorical, as the zones vary from 1 (Brisbane CBD) to 23 (rural South East Queensland). If the algorithms can retain the distributions of origin–destination (OD) pairs, they can reproduce the spatial distribution of trips. Figure 6 shows three Chord diagrams of public transport trips from the real data (Fig. 6a), generated data from GAN (Fig. 6b) and generated data from BN (Fig. 6c). The chords show the number of trips between pairs public transport zones in South East Queensland, Australia. The larger the chords, the more trips are there in the data.

Fig. 6
figure 6

Distribution of origin–destination pairs of the three models

Both GAN and BN can replicate the overall spatial travel patterns, where the majority of the trips are between and within a few zones. Figure 6 shows that zones 1, 5, and 6 are popular zones, and there are a lot fewer trips starting or ending in zones 10–23. While both GAN and BN can replicate those patterns, BN seems to more accurately generate the proportion of trips from each zone. In the data generated by GAN, the most popular zones (zone 1, 5, and 6) are slightly less popular than in the real data set, whereas the remaining zones have a larger share than the real data.

Similar to the above section where the KS test was adopted to evaluate the fit of two continuous variables (\(Tag\_on\) and \(Tag\_off\)), here, we adopt another statistical test to evaluate the categorical variables Direction, Route, ZoneStart, and ZoneEnd. We implement the Chi-Squared (CS) test to compare the distributions of two discrete columns, one from the generated data and one from the real data (e.g., Direction, Route, ZoneStart, and ZoneEnd). The output for each variable is the p value from the CS tests. The value indicates the probability of the generated data having the same distribution as the real data. Thus, the higher the output, the closer the generated data are to the real data. The output of the CS test shows that the generated categorical data from BN have a better fit compared to the data from GAN (with CS test’s p value of 0.958 for BN and 0.879 for GAN, respectively).

Although the Chi-squared test is widely used for analysing discrete variables, it may not be appropriate for our dataset. This statistical test is typically based on the assumption of a normal distribution with data values exceeding five. Other distance matrices or statistical tests, such as Jaccard or Sorensen–Dice index, are not well suited for our evaluation needs, because some variables, such as origin and destination data, must be assessed jointly, and the frequency of data values is an essential factor. We then develop a new spatial distribution distance index to explore the performance of generated OD pairs and the real data, as follows:

$$\begin{aligned} d_{OD} = \sum _{i=1}^K | \mu _i^s - \mu _i^r |, \end{aligned}$$
(2)

where \(\mu _i^s\) and \(\mu _i^r\) are the proportion of trips from the \(i_{th}\) pair of OD, from the synthetic and real data, respectively. \(\mu _i^s\) and \(\mu _i^r\) can be calculated by simply dividing the number of trips from the \(i_{th}\) pair of OD to the total number of trips in the dataset

$$\begin{aligned}{} & {} \mu _i^s = \frac{\varSigma Trips_{i}^s}{\varSigma Trips^s} \end{aligned}$$
(3)
$$\begin{aligned}{} & {} \mu _i^r = \frac{\varSigma Trips_{i}^r}{\varSigma Trips^r}. \end{aligned}$$
(4)

Unlike the popular statistical tests and distance index, the proposed spatial distribution distance index \(d_{OD}\) takes into account the patronage distributions between individual OD pair, and is free from any statistical assumptions. The generated spatial OD distribution from the synthetic data fits better with the real data if \(d_{OD}\) is smaller. The results are inline with the visual comparison on Fig. 6, as the \(d_{OD}\) value for BN is 0.14, while the \(d_{OD}\) value for GAN is 0.45.

Spatial–Temporal Travel Patterns

We then evaluate whether GAN and BN can replicate both spatial and temporal patterns in the data when they are evaluated together. Here, we look the \(Origin\_zone\), or the zone ID of trip origin, and \(Tag\_on\), or the starting timestamp of a public transport trip. Figure 7 shows the distribution of tag on timestamps on the first six travel zones in the data. It illustrates that even with one more dimension being considered (spatial zones), both GAN and BN can still replicate the temporal distribution of tag on timestamps. Both the morning and afternoon peaks are replicated in the synthetic data. Here, we can also observe that each zone has different traffic patterns that neither GAN nor BN are able to capture perfectly, e.g., the unusually high peak in the afternoon for Zone 3, or in the morning for Zone 4.

Fig. 7
figure 7

Distribution of tag-on zones at different time of day

Finally, we look at the distribution of travel time at each travel zone in Fig. 8. This is the most challenging variable for GAN and BN to capture, as we are interested in a temporal by-product (travel time) that is spatially constrained (travel zones). Figure 8 shows the real and generated distribution of travel time at the first six travel zones in the data.

Fig. 8
figure 8

Distribution of travel time at each zone

Figure 8 shows that each zone has a unique distribution of travel time. Zone 1 and Zone 2 have more trips at higher travel time than the rest of the zones, while the travel time of trips from Zone 4 to Zone 6 are highly concentrated at lower values. Both GAN and BN struggle to learn the complex travel time distribution at different zones, where BN performs slightly better than GAN. The generated travel time is relatively stable across the zones. We leave the spatial learning of by-product temporal variables (e.g., travel time) to a future study, as spatial interaction data synthesis models may need to be introduced for this purpose.

Evaluation Step 2: Similarity Index for Multi-variate Synthetic Data

While we have a mixture of continuous and discrete variables in our dataset, KS and CS tests only compare individual variables, and each only works will with a single type of variable (KS test for continuous variables, and CS test for discrete variables). To adequately gauge the relative performance of the data synthesis methods discussed in this paper against each other, a similarity index was developed to compare synthetic data generated by the methods against each other and against the training data. Comparing samples from distributions over arbitrary spaces is, in general, a challenging problem. While good statistical methods such as the two-sample t test or ANOVA exist for data that is low-dimensional and normally distributed, the data here are neither normally distributed nor low-dimensional. The fact that the data are hybrid continuous-discrete data further complicates matters, as it means that defining a distance metric on the data points becomes a complex undertaking.

To this end, optimal transport theory is a proven way to measure the distance between probability distributions. For a one-dimensional space, this can be archived using the Wasserstein metric, also known as “Earth Mover’s Distance”, which is the distance between probability distributions. Unlike the KS or CS test, the Wasserstein metric does not require both measures to be on the same probability space. Intuitively, the Wasserstein metric is the transportation cost of shifting the probability mass of one distribution, such that it is made identical to another distribution. The shorter the distance between the distributions, the closer the distributions are to being identical. This is defined as follows:

$$\begin{aligned} W_p (\mu , \nu ):=\left( \inf _{\gamma \in \varGamma (\mu , \nu )} \int _{M \times M} d(x, y)^p \, \textrm{d} \gamma (x, y) \right) ^{1/p}. \end{aligned}$$

To estimate the distance between multivariate probability distributions, we formulate an optimal transportation problem for shifting mass from one distribution or sample to another and solving it to optimality, so distance can be found to be equal to the value of the objective function at optimality. With this in mind, we proposed a Wasserstein-based distance metric for two arbitrary dataframes of multiple variables over a defined metric space. We then solve this Linear Programming problem using a Linear Programming solver (Python library PuLP and COIN-OR solver). For simplicity, we use normalised Euclidean distance to calculate the distance between data points.

Having developed an adequate way of calculating the Wasserstein metric between hybrid datasets, the next step was to estimate values for the distance between the different datasets. As calculating the values directly made comparisons difficult and was computationally prohibitive, this was performed using a bootstrapping approach: a sample of 200 data points was drawn from the training data, synthetic GAN data, and synthetic Bayesian Network data, and the distance between each pair of samples was calculated. By repeating this procedure a hundred times and taking the arithmetic means of the distances calculated between each pair of dataset samples, an estimate of the Wasserstein distances between our three datasets was calculated.

Interestingly, and in conflict with the results from the statistical method detailed above, the Wasserstein Distance implementation detailed here suggests better performance by the GAN than by BN in modelling qualities of the real data. The mean Wasserstein distance between samples from real data and the synthetic dataset created by BN is 151.52, as opposed to a distance of only 129.11 between the real data and the synthetic GAN dataset. The distance calculated between the BN data and the GAN data, 177.61, is very close to that calculated for the distance between synthetic BN data and real data. This finding is of significant interest, as it suggests that in some sense, the GAN is capturing the coupled structure of the data better than the Bayesian approach does.

A potential explanation for this finding may lie in the way in which each of the methods capture the tag-on and tag-off times in the data: in particular, the BN method can only generate data points that exist in the original datasets, whereas the GAN is capable of generating novel timestamps. This could account for the seemingly contradictory observation mentioned above: if the GAN consistently generates timestamps that are close but not identical to real timestamps, whereas the Bayesian network is forced to choose an existing timestamp to represent the data, and the observed contradiction could well be explained.

Evaluation Step 3: Privacy Validation of Synthetic Data

To demonstrate that the data generated by both methods do not violate privacy regulations by exposing real data to the end-user, samples of synthetic data were generated by the BN and GAN methods. The training data were then searched for duplicates, and this was repeated in a bootstrapping procedure.

It was found that, in a sample of 10000 synthetically generated trips using both the BN approach and the GAN approach, no real trips could be identified. Repeating this procedure ten times gave the same result, suggesting that if real trips are generated, they are exceedingly uncommon. This rate of real trips in the data thus does not pose a threat to privacy.

Discussion and Conclusion

Privacy concerns have led to the use of data anonymisation techniques to protect personal information in human activity data analysis. However, recent research has revealed the limitations of classical anonymisation techniques for big transport data, prompting the use of AI and machine learning to generate synthetic data that retain distributions and correlations without any personal records. These approaches offer a more effective solution for preserving confidentiality and data utility in research.

This paper aims to generate a synthetic version of a large transport dataset, the Smart Card data, that maintains the spatio-temporal mobility patterns of individuals while ensuring the anonymity of each data point. This objective diverges from prior research, which has primarily focused on synthetic data generation for images and texts, and has yet to adequately address transport data. The paper compares two advanced methods for synthetic data generation of human activities data, and proposes the use of Wasserstein distance to evaluate the similarity of generated synthetic data to the real data. Both methods can model the data well and generate synthetic data that are similar to the real data. BN’s synthetic data are slightly better when individual variables are concerned, and even when spatio-temporal data are compared, as BN can learn the peaks in the data better. However, our proposed Wasserstein-based multivariate metric shows that when all variables are considered, GAN can produce a slightly better synthetic data.

One of the primary limitations of our approach is the coarse spatial resolution. The granularity of our synthetic data may not be sufficient for certain types of analyses, particularly those requiring geographical coordinates. Future works may include a finer-scale data synthesis for spatial data, e.g., geographically weighted neural networks Hagenauer and Helbich (2022) can be integrated into spatial-aware GAN.

Our synthetic data generation methods are also limited by the absence of travel purpose data in Smart Card data. This restricts the usability of our synthetic data for certain research inquiries, particularly those concerning travel behaviour and activity modelling. Nonetheless, it is important to emphasise that our approach is not aimed at substituting datasets that include comprehensive travel purpose data, but rather to provide a privacy-preserving alternative for researchers who cannot access or use sensitive personal data such as Smart Card. In future studies, we could explore inferring and validating travel purpose using Smart Card data and re-evaluating the generated synthetic data to assess if it preserves travel purposes. This may require monitoring and modelling the travels of a single traveller over multiple days using unique identifiers in the dataset.