Introduction

In biomedical research, scientific progress is often limited by the availability and quality of data. The results of any study can only be as good as the data on which the statistical analysis was based. Additionally, machine learning methods - including deep learning - are completely dependent on the quality and amount of the available training data. For some application fields it is extremely hard to obtain enough data to be able to draw any conclusions, for example in the case of rare diseases. In particular, to realise the full potential of deep learning methods, the amount of data should be larger than for more traditional machine learning approaches1. To increase availability of medical data, a legally compliant mechanism for sharing it between different institutions is needed. Furthermore, being able to share the data used in a publication is an important step in making its results reproducible in terms of the FAIR principles2. However, the storage, processing, and sharing of individual-level health data is tightly regulated and restricted by law, as health-related data are generally considered to be highly sensitive (e.g.,3 Art. 9).

Sharing personal health data in compliance with laws and regulations, such as the European Union (EU) General Data Protection Regulation (GDPR), usually requires informed consent. However, this is often not feasible, for example if data is to be analysed retrospectively at a large scale. As an alternative, data can be anonymised in such a way that it cannot be traced back to specific individuals anymore. This typically requires significant modifications, e.g., by removing direct identifiers (such as names), and by coarsening indirect identifiers (such as age or geographic region). And this approach inevitably requires balancing the reduction of risks achieved through the removal of information with associated reductions in the utility of the data4. This also means that a complete anonymisation might not always be possible to achieve for many types of data, with genetic information being a common example.

An alternative way to share data could be to use synthetic data generation methods that have been proven efficient in the past, e.g., by5,6,7,8,9. In this process, instead of modifying the original data to make them harder to re-identify, a completely new dataset is created, ideally with similar statistical properties as the real life data.

In this study, we apply and adapt state-of-the-art algorithms to generate synthetic data for a defined use case, for which several downstream analyses have already been performed on the real data and their results have been published. In addition to analysing summary statistics of the dataset, this gives us the opportunity to gain deep insights into the possibilities and limitations of synthetic data in the specific use case.

The original data that are used are parts of the data collected within the Dortmund nutritional and anthropometric longitudinally designed (DONALD) study, which is an ongoing nutritional cohort study capturing information about the diet and health of children in Dortmund, Germany since 198510. Participants are recruited as newborns, and then accompanied until young adulthood to get a holistic picture of their developing health and the dietary factors that influence it. The subset used during this study includes dietary data with focus on sugar intake based on all dietary records of participants aged three through 18 years between 1985 and 2016. The resulting dataset consists of structured health data, where the same 33 variables have been recorded over fixed yearly intervals.

The data collected by the DONALD study have already been used for several nutritional studies, such as the recent analysis of sugar intake time trends by Perrar et al.11,12, which was performed on the same subset used for this work. Hence, this subset, that will from now on only be called DONALD data, has the following properties that need to be taken into account when searching an appropriate synthetic data generation method: It is longitudinal, because its data have been collected over a series of 16 visits. Additionally, it contains static variables that are only collected during the first visit. It is also heterogeneous, since its columns consist of various different data types. Finally, it is incomplete in terms of longitudinality, as not all participants attended each annual visit.

In the literature, a variety of different machine learning-based methods have been proposed to generate synthetic data, and we will discuss the following three classes that are commonly used (or combinations of them): probabilistic models, variational autoencoders, and generative adversarial networks (GANs). Especially GANs, originally developed by Goodfellow et al.13, are currently used in a variety of generative tasks and have proven successful in the generation of realistic fake images and in producing natural text data (e.g.,14,15,16,17). However, GANs have been proposed for the generation of continuous data13. Choi et al.18 address this by combining a GAN with an autoencoder in their architecture, which has been pre-trained to compress and reconstruct the original data. The resulting model, MedGAN, learns to generate EHRs. It provides an attractive solution to the generation of convincing electronic health records with continuous, binary and count features. Unfortunately, it cannot be used for the generation of DONALD data, since it is not able to generate longitudinal data.

An alternative for longitudinal data has been developed by Esteban et al.19, who propose two different models (RGAN and RCGAN), both based on recurrent neural networks, to produce real-valued time series data. Even though the model is able to learn longitudinality, it cannot cope with heterogeneous data types, as well as a mixture of static and longitudinal variables that are typical in clinical studies. A further alternative is timeGAN, another GAN-based method that learns an embedding to represent longitudinal data in a lower dimensional space20. However, timeGAN is unable to generate static covariates and has not been applied to the medical domain yet – hence, data incompleteness would potentially cause problems that require further work.

Another promising method is called Variational Autoencoder Modular Bayesian Network (VAMBN), which has been, indeed, designed to generate fully synthetic data for mixed static and longitudinal datasets containing heterogeneous features with missing values21. To achieve this, a Heterogeneous-Incomplete Variational Autoencoder (HI-VAE) is combined with a Bayesian Network (BN). This works by splitting the data into modules, training a HI-VAE for each of them to produce encodings, and fitting the BN over all encodings of the modules. Longitudinality is handled by assigning the data of different visits to different modules. Due to its capabilities, VAMBN serves as a baseline algorithm for the present study.

Because of the generative nature of the task – in contrast to discriminative tasks – the evaluation of the quality and usefulness of the data is not trivial, especially because it is performed on complex, heterogeneous health data. Different measures are used in different studies, probably due to different features having large variations in type and importance, which are highly dependent on the use case. Additionally, there is no standard terminology used by related studies. In this paper, we will follow the terminology from Georges-Filteau and Cirillo, who classify the metrics broadly into quantitative and qualitative metrics22. Here, qualitative analysis is based on visual inspection of the results done by field experts. For example, in preference judgement, given a pair of two data points – one real and one synthetic – the aim is to choose the most realistic one18. A similar method of this category is called discrimination task, where the expert is shown one data point at a time and needs to decide whether it is realistic or not. However, according to Borji, qualitative methods that are based on visual inspections are weak indicators and quantitative measures offer more convincing proofs of data quality at the dataset or sample level23. Georges-Filteau and Cirillo22 further classify three subcategories of quantitative measures: comparisons between real and synthetic data on dataset level, comparisons of individual feature distributions, and utility metrics, which indicate whether the synthetic data can be used for real world analyses that were planned or already performed on the real dataset.

Assessing the privacy of the data may be an even more difficult task, and is also handled differently in related works. While in some studies the risk of re-identification is assessed or controlled using empirical analyses (e.g.,18,24), another established method known as differential privacy25 can provide theoretical probabilistic bounds for various types of privacy risks. This method has been adapted to the area of deep learning by adding a specified amount of noise during training26. Differential privacy can be integrated into all algorithms presented in this study. While acknowledging the importance of a more careful analysis of the privacy risks associated with synthetic data, we would like to point out that this is in practice a rather complex task, which we see out of the scope of this paper.

In the present study, we make the following contributions: We apply the state-of-the-art algorithm VAMBN to generate synthetic data based on the DONALD dataset – a dataset from a nutritional cohort study. We extend VAMBN with a long short-term memory (LSTM) layer27 to more effectively encode longitudinal parts of the data and show a significant increase in the ability to reproduce direct dependencies across time points. We evaluate our generated synthetic data on four different levels and show that while descriptive summary statistics and individual variable distributions can be efficiently reproduced with all chosen methods, direct dependencies can only be reproduced by our proposed extension. With this, we apply real-world experiments together with domain experts and gain valuable insights on how to exploit the potential of fully synthetic datasets.

Methods

Data

DONALD data

The method of the used DONALD dataset has been described in detail by Perrar et al.11,12. The main content of the data for each record is the nutrient intake, e.g., fat or carbohydrate intake, as well as the intake of different types of sugars, measured as a percentage of the total daily energy intake in kilocalories per day (%E).

Across all available records from the 1312 participants, the dataset spans the ages three through 18, containing 36 variables. An overview of the variables can be seen in the Supplementary Material in Table S1. The only non-longitudinal variables are a personal number to identify each individual (pers_ID), an identification number for each family (fam_ID), and the sex of the participant. So a complete participant record contains 530 variables (3 static + [16 visits \(\times\) 33 longitudinal variables]). Note that some participants have not been part of the study for all 16 visits, for example because they are still under 18, and that some of the yearly visits may also have been skipped for unknown reasons. Figure 1 explores the amount of missingness per participant and visit. Apart from missed visits, there is almost no missingness in the data. The only exceptions are two variables that describe the overweight status and the education level of the mother of the participants (m_ovw and m_schulab), which have small amounts of missingness (1.25% and 0.17%, respectively). For missing values in the original analyses11,12, the respective median of the total sample was used (n = 38 for maternal overweightness, n = 5 for maternal educational status).

For the application of VAMBN, the data need to be grouped into different modules. This has been initially done by experts and resulted in four modules for the DONALD data: Times (T), Nutrition (N), Anthropometric (A), and Socioeconomic (S). Over the study course, varying settings have been tested. An overview can be found in the Supplementary Material in Table S2. Moreover, for each setting, there are a few static variables – also called covariates – that are not grouped (i.e., the family number and the sex of the participant).

Figure 1
figure 1

Missingness in the DONALD dataset. In (a), the amount of visits that have been attended by the participants in total can be seen. (b) depicts the number of participants that have attended a specific visit in blue and in red, the number of participants who entered the study at this specific visit is shown.

The following applies to the original data: The DONALD study was approved by the Ethics Committee of the University of Bonn (project identification of the most recent version: 185/20) according to the guidelines of the Declaration of Helsinki. All examinations are performed with parental as well as from adolescents (16 years) onwards, with participants’ written informed consent.

Pre-processing The data are stored in a tabular format with one row per visit per participant. As an identifier, the personal number is given so that each row can be assigned to a specific participant. As can be seen in Fig. 1, the participants do not necessarily attend all 16 visits. To be able to apply the synthetic data generation methods to the DONALD data, the single rows of the dataset need to be mapped to the different visits based on the age of the participant. Thereby, we define 16 visits from zero to 15, where visit zero happens with the age of three and visit 15 with the age of 18, respectively. This results in a tabular format where each row corresponds to one single participant, including all different visits, so there are 530 columns. In this result, the degree of missingness becomes visible for every row.

Post-processing To perform the subsequent expert analysis, we first need to convert the data back to the original format. This is done by mapping the data back to 16 rows per participant, i.e., one row per visit. Note that the synthetic data do not contain any random missingness or missed visits.

While we are analysing the effect of sample size, we apply the following further post-processing steps to ensure that the datasets are as close as possible to the original data:

  • The synthetic dataset must contain the same amount of participants.

  • The amount of items, i.e., visits, need to correspond to the original data.

  • The fraction of sexes in the synthetic and real cohorts should be similar.

For the evaluation, we distinguish between raw and post-processed output to fully evaluate the algorithms (see Section "Evaluation Metrics" for details). Note that for both output formats the first step – i.e., mapping the data back to 16 rows per participant – has been applied.

ADNI data

For a proof of concept of our applied methodology, we further assess the relevance of our findings based on a second cohort study, obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (adni.loni.usc.edu). From the ADNI database, the ADNIMERGE cohort has been used. The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD). ADNI data have been applied in previous studies and related work for many applications including synthetic health data generation, such as7,28,29. The results can be found in the Supplementary Material.

Model

As baseline model, we use a Variational Autoencoder Modular Bayesian Network, short VAMBN21, which has been designed to generate fully synthetic data for longitudinal datasets containing heterogeneous features and missing values. To achieve this, a Heterogeneous-Incomplete Variational Autoencoder (HI-VAE)30, which is able to handle incomplete and heterogeneous data, is combined with a conditional Gaussian Bayesian Network (BN)31. To apply VAMBN, the dataset is first split into modules, which are encoded by individually trained HI-VAE modules. Thereby, different variables of the dataset are grouped together based on context, preferably together with domain experts. For each module at each time point (i.e., visit), an HI-VAE is trained, learning a low dimensional Gaussian mixture model of the input data. The resulting embeddings (consisting of discrete as well as Gaussian variables) are then used as input for a Modular Bayesian Network (MBN), learning dependencies between these modules. At this point, auxiliary variables are introduced as missingness indicators for entire visits. For structure learning, so-called black and white lists are employed that prevent or enforce certain edges, respectively, in order to constrain the space of admissible graph structures as much as possible. From this MBN, synthetic embeddings can be drawn, and subsequently decoded by their respective HI-VAE modules to produce the final synthetic data. We refer to21 for a more detailed and mathematically precise explanation.

Our extension has primarily been developed to improve VAMBN’s ability to reconstruct direct mathematical dependencies between variables from different time points due to the longitudinal nature of our dataset.

General idea A simple approach to enable VAMBN to learn correlations is placing the correlated variables in the same module, because then the HI-VAE can account for their dependencies on its own. But this is limited to correlations at one specific visit, as the same variable observed at different time points is split into separate HI-VAE modules. The resulting embeddings are subsequently modelled via the BN. However, depending on the quality of the HI-VAE embeddings, this separation may result in a weakening of the temporal correlation structure after data synthesis.

This is why in our proposed extension, all visits of each variable group are instead simultaneously encoded by one HI-VAE, which is extended by an LSTM encoder. As a result, longitudinal variables share one embedding, and thus the BN is only used to learn dependencies between the embeddings of all the different variable groups, standalone variables, and missingness indicators.Here, standalone means that these variables are not grouped with any other variables, do not get embedded by HI-VAE, and are presented to the BN as-is.

Architecture To be able to encode all V time points (also called visits) of all D variables from a module with a single HI-VAE, some changes to the pipeline have to be made. If we have a data set with N entries and D variables, then:

$$\begin{aligned} \textbf{x}_n = [\textbf{x}_{n1}, ..., \textbf{x}_{nV}] \end{aligned}$$
(1)

and each element of the vector is D-dimensional.

$$\begin{aligned} \textbf{x}_{nv} = [x_{nv1}, ..., x_{nvD}] \end{aligned}$$
(2)

Therefore, the recognition model, originally proposed by Nazabal et al.30, now gets the observed variables for all visits as input, with \(\textbf{x}_n^o = [\textbf{x}_{n1}^o, ..., \textbf{x}_{nV}^o]\). As mentioned, a special type of a recurrent neural network is used here – an LSTM – due to its ability to prevent the vanishing gradients problem even when being trained on many time steps. The speciality of the LSTM is that its repeating unit contains four different layers, instead of just one and follows the core idea that there is a cell state c running through the network. With the different layers – also called gates – it can be decided which previous information needs to be passed to the next unit and which should be forgotten. The number of repeating units determines the output dimensionality (\(h_{end}\)) and is configurable. The output of the LSTM serves as an intermediate representation, which is the input for the deep neural networks (DNNs) that – as in the original HI-VAE – parameterise the distributions from which \(\textbf{s}_n\) and \(\textbf{z}_n\) are drawn. Here, \(\textbf{s}_n\) is a one-hot encoding vector, which represents the component in the Gaussian mixture model, that is the Gaussian from which the encoding \({\textbf {z}}_n\) in latent space is drawn.

This leads to further changes that need to be done to the generative model (i.e., the decoder): The intermediate representation vector \(\textbf{Y}\) is now also V times larger, to account for the increased number of data points encoded by the embedding \(\textbf{z}\). Here, \(\textbf{Y}\) is a single homogeneous intermediate representation vector, produced by a DNN \(\textbf{g}(\textbf{z})\), as in the original implementation. Note that here \(\textbf{g}\) learns any dependencies between variables across different time points, before \(\textbf{Y}\) is consumed by the DNNs that each parameterise one of the \(V\times D\) distributions of attributes at specific visits. Separate parameterisations for different visits are required, since the same variable may be distributed very differently at varying time points. Formally, the decoder factorises as:

$$\begin{aligned} p(\textbf{x}_n, \textbf{z}_n, \textbf{s}_n) = p(\textbf{s}_n)p(\textbf{z}_n|\textbf{s}_n)\prod _v\prod _dp(x_{nvd}|\textbf{z}_n,\textbf{s}_n) \end{aligned}$$
(3)

Hence, it is factorised into three parts: the probability of \(\textbf{s}_n\), the conditional probability of \(\textbf{z}_n\) given \(\textbf{s}_n\), and the likelihoods of producing the output values \(\textbf{x}_{nvd}\) for all variables across the V visits and D attributes, given \(\textbf{z}_n\) and \(\textbf{s}_n\).

So the Evidence Lower Bound (ELBO) – the optimisation criterion that is used during training of the autoencoder – is now calculated as a sum over the contributions of all \(N\times V\times D\) values to be able to account for their heterogeneous data types and varying missingness over time.

The Bayesian network still works the same as in the original VAMBN approach, although it is much smaller due to the merging of all visits into one module (i.e., a node of the BN). Hence, the nodes of the BN originate from M modules instead of \(V\times M\) modules (+ standalone variables, respectively).

To analyse the effect of the two described changes in the architecture, we compare the following VAMBN variants:

  1. 1.

    Original VAMBN implementation

  2. 2.

    VAMBN – Memorised Time Points (VAMBN-MT): as described above, all V visits of a module’s D variables are encoded in one HI-VAE model, its ELBO is calculated as a sum over all \(N\times V\times D\) contributions, and the LSTM is inserted before the encoder.

  3. 3.

    VAMBN – Flattened Time Points (VAMBN-FT): to judge the added value of the LSTM, we replace it with a standard feedforward network.

An overview of the applied architectures can be seen in Fig. 2.

Figure 2
figure 2

Overview of the developed architecture. For all settings, the real-world data are pre-processed and then embeddings are learned by a heterogeneous-incomplete variational autoencoder (HI-VAE). Here, we compare three different model architectures: the original HI-VAE used in the baseline approach Variational Autoencoder Bayesian Network (VAMBN), VAMBN – Flattened Time Points (FT), where the structure of the feedforward network is changed in such a way that all visits per module are encoded together, i.e., learned in one model, and VAMBN – Memorised Time-Points (MT), where the default feedforward network is then changed to an LSTM layer in order to better cope with longitudinal dependencies. The changes in FT and MT lead to a reduction of complexity in the Bayesian Network.

Evaluation metrics

We analyse the synthetic data on different levels in order to be able to judge its quality. The methods can be divided into the following four categories – thereof, the two latter methods are dataset-specific, since our focus lies on the utility of the generated data.

We analyse the individual variable distributions to ensure that they are correctly distributed across the entire synthetic population. Therefore, we provide summary statistics and density plots. Moreover, we quantify the differences between real and synthetic data distributions using the Jensen-Shannon (JS) divergence that measures the relative distance between two probability vectors32. The output ranges from 0 to 1 with 0 indicating equal distributions. Data are binned in order to get probability vectors. We use numpy’s method histogram_bin_edges to determine the optimal bin size per variable by choosing the maximum of Sturges’ rule33 and the Freedman Diaconis Estimator34.

Since correct distributions alone are not a sufficient indicator of convincing data however, we furthermore evaluate the reproduction of correlations between the variables generated by the generative model. To account for this, the Pearson correlation coefficients for all pairs of variables are calculated, including correlations across time points, and are visualised in a heatmap. For quantification, we determine the relative error \(\epsilon\) of the correlation matrices as can be seen in Eq. 4. Therefore, the Frobenius norm of the difference of the real and virtual correlation matrices is divided by the Frobenius norm of the real data correlation matrix. Hence, the value can range between 0 and infinity with 0 indicating a perfect reproduction of the correlations.

$$\begin{aligned} \varepsilon = \frac{||real-virtual||}{||real||} \end{aligned}$$
(4)

Even if the variable correlations are good, this still does not guarantee that the result is convincing. Variables in the original data may not only correlate, but even have direct mathematical relationships that need to be met for the synthetic data to be realistic and plausible. The existence of such direct dependencies is unique to the given dataset, so this work will mention them for the DONALD dataset and analyse to what degree they are met by the synthetic data.

Finally, the direct practical utility of the synthetic data can be tested by running real-world analyses on both the original and synthetic datasets, comparing their results. For the DONALD data, this means conducting the same time and age trend analyses in added sugar intake following the methods by Perrar et al.11,12, to investigate whether we can reproduce the results. In this analysis, polynomial mixed effects regression models have been determined. We use their unadjusted models for time and age trends. Note, that we re-built the polynomial mixed-effects models in R, that were originally coded in SAS, thus the values for the original data can deviate slightly from the values determined by Perrar et al.12. Additionally, since our aim was not to investigate intake trends, but to generate a direct comparison between original and synthetic data, we have simplified the presentation of the trend analyses, i.e., separate presentation of age and time trends.

Experimental setup

An overview of our experimental setup can be seen in Table 1. For the evaluation of individual variable distributions, correlations between variables, and direct dependencies, we compare our two developed methods (VAMBN-FT and VAMBN-MT) with the baseline approach (VAMBN). Because we judge their effectiveness relative to the real data, we choose the same sample size (\(N=1312\)). For the subsequent real-world analyses, we choose VAMBN-MT for all experiments, but additionally investigate the influence of varying module selections – dependent on the research question. The module selections can be found in the Supplementary Material in Table S2. For each selection, we sample 1312 and 10,000 participants, respectively. While we want the smaller dataset to resemble the original data as much as possible, we apply the previously described post-processing (such as inserting the same percentage of missingness). In addition, we investigate the difference when using a much larger dataset without any post-processing. For each experiment and sample size, we sample 100 synthetic datasets (from the same model). To prevent influencing the time trend with outliers, we omit time points that are larger than the maximum value observed in the original data (i.e., lie in the future). For each time point, to represent all 100 trend functions, we plot their means along with their 2.5% and 97.5% quantiles.

For all performed experiments, we use the same black- and white lists and hyperparameter settings that can be found in the Supplementary Material (see Table S3).

Table 1 Overview of the experimental setup.

Results

We evaluated the generated synthetic data on four different levels. First, the individual variable distributions are presented, followed by the investigation of correlations between different variables. Moreover, use-case specific direct dependencies are evaluated. Finally, the results of the real-world analyses can be found.

Individual variable distributions

In order to compare the distribution between the real and synthetic data, individual distributions were plotted as a first step. Examples of these distributions can be found in Table 2, Figs. 3, and 4, respectively. Overall, it can be seen that the summary statistics are very similar for all the different generated datasets and match the distributions of the real variables. For example, for the total sugar intake (\(ZUCK\_p\)), the mean of the real data amounts to 26.96, whereas the means of the synthetic data amount to 26.12, 26.1 and 26.09 for the three methods. As an example, the distribution for the third visit is visualised in Fig. 3, where slight differences between the three methods are visible and VAMBN-MT best matches the original distribution. This is also indicated by the determined JS-divergences of 0.15, 0.11 and 0.09 for the three methods, respectively. Discrete variables mostly get reconstructed correctly by VAMBN and its extensions as well. Three examples can be seen in Fig. 4. In Fig. 4c, a clear improvement can be seen for VAMBN-MT. The averaged JS-divergence over all variables (both discrete and continuous) and across all visits results in \(0.15\pm 0.11\) for VAMBN, in \(0.12\pm 0.08\) for VAMBN-FT and \(0.12\pm 0.09\) for VAMBN-MT.

Table 2 Summary statistics for individual variables across all visits, covering mean, standard deviation (SD), the upper and lower quartiles (i.e., 25% and 75%, respectively), and the Jensen-Shannon divergence (JS-div.).

We determined the JS-divergence also for longitudinal variables from the ADNI dataset (across four visits, 56 variables in total). The average JS-divergence for the original and MT versions of VAMBN are calculated as 0.175 ± 0.107 and 0.167 ± 0.115, respectively. The results can be seen in the Supplement in Table S5 and suggest that both the original VAMBN model and the VAMBN-MT extension are quite well in replicating original dataset’s statistical distributions. Moreover, it is evident that the MT extension of the VAMBN model at least marginally outperforms the original non-MT model for the majority of (38 out of 56) variables.

Figure 3
figure 3

Distribution of the total sugar intake (\(ZUCK\_p\)) at visit three, shown for the three different algorithm architectures. The distribution of the real data is shown in blue, for the synthetic data shown in orange and the overlap is indicated in gray.

Figure 4
figure 4

Distribution of different discrete variables at different visits.

Correlations between variables

We use correlations between variables as an indicator of the extent to which the properties and dependencies of the real data are represented in the synthetic data. We checked for dependencies both within the same visit and across different visits. In the heatmap representation of the Pearson correlation matrices between all variables of all visits (Fig. 5), it can be seen that absolute correlation values are generally higher in the original data (Fig. 5a). With VAMBN, we obtain significantly smaller pair-wise correlations than in the real data and get a relative error of 0.86. With the VAMBN-FT version, the error is reduced to 0.74 (see Fig. 5c) and the best result is achieved with VAMBN-MT with an error of 0.70 (see Fig. 5d). In general, also in the original data, correlations are higher between variables of the same module than between variables of different modules, which indicates a good module selection.

Figure 5
figure 5

Heatmaps of the Pearson correlation matrices of all visits for every variable for the real data (a) and the synthetic data produced by the three different approaches (bd). Note, that we inserted the same amount of missingness to the synthetic data to get comparable results. Their columns are sorted by variable groups (standalone> time> anthropometric> socioeconomic > nutrition, within groups alphabetical by variable name) first, indicated by the letters below (a). Within each variable, columns are sorted by visit ascending. Thus, the visible squares formed in the heatmap each correspond to the correlation of two variables over their 16 visits.

Concerning the ADNI data set, the heatmaps of the Pearson correlation matrices of all visits for all the variables for the real as well as the synthesised data using the original and MT extension of the VAMBN model are illustrated in Figure S4. The visualised heatmaps suggest that the Pearson correlation coefficients from the synthesised variables using the MT version are slightly closer to those of the real data compared to those derived from the variables generated using the baseline model with the relative errors of \((\epsilon =0.541)\) and \((\epsilon =0.546)\), respectively. Compared to the DONALD data, we can see that the correlations can be generally learned more effectively, even with the original VAMBN. This may be explained by the fact that the ADNI data contain less longitudinality (only four consecutive visits) and more variables per visit (56).

Direct dependencies

The chosen DONALD dataset contains several interesting direct dependencies, representing case-specific expert knowledge, that can be used to further judge the quality of the synthetic data by investigating whether these dependencies can be correctly reproduced. Therefore, we chose two of these: Firstly, the boolean variable m_schulab, indicating whether the participant’s mother has had 12 years of school education. In the original dataset, this variable mostly stays the same, and might increase from 0 to 1 if a participant’s mother receives further education during the study. Logically, the indicator never changes from 1 to 0, however. In Fig. 6, we show the value of the graduation status over all 16 visits for 100 randomly chosen samples for each of the three experiments, respectively. It gets evident that the original VAMBN approach is not able to correctly reproduce dependencies across time, because the graduation status of the mother often changes back and forth which is not plausible (see Fig. 6a). We see an improvement in the VAMBN-FT experiment (Fig. 6c), which makes sense because here, the HI-VAE module receives all visits as one input, containing all variables involved in this dependency. But still, the result is not satisfactory. In contrast, only a few errors are left when using VAMBN-MT, as can be seen in Fig. 6d.

Figure 6
figure 6

Examples of reconstructions of the boolean variable m_schulab across time. This variable indicates whether the mother of the participant has attended 12 years of school education. Whenever it was set to True (light blue) at some point, it can logically not change back to False (dark blue) in a later visit, as can be seen in the original data in (a). The subfigures (bd) represent the results for the data produced by the three different approaches, respectively. In each case, 100 samples have been drawn randomly.

As a second direct dependency, we investigated the relationship between the variables alter and time, which describe the exact age of the child and the study, respectively. They are both given as positive real-numbered values and form the variable group times. Again, for all three experiments, their summary statistics seemed plausible for all visits: The age of synthetic participants was around the expected age, with some random variance. And the age of the study was commonly anywhere between 0 and 30 years – which is also plausible, since the study had always included children of any age over the course of its existence. When looking at an individual child over multiple visits however, the age of the study kept fluctuating, often even getting younger. Whereas in the original data, if the age has increased by \(\Delta _{alter }\), then the increase in time \(\Delta _{time }\) is exactly the same. For the virtual participants produced by the three experiments, the error \(\Delta _{time }-\Delta _{alter }\) has been calculated for any two consecutive visits, so 15 times per participant. Figure 7a shows the results across all visits of all participants for all three experiments. Here again, we see the worst result for the baseline approach, and the best result for VAMBN-MT. The error in passing time is clearly smaller than in the other two VAMBN versions. Hence, we can conclude that our LSTM adaption significantly improves the reconstruction of direct dependencies between variables within and across time points. However, these observed improvements are limited to variables across different visits with direct mathematical relationships. Similar relationships existing within a single visit are mostly unaffected by the extension, as Fig. 7b shows.

Figure 7
figure 7

Reproduction of direct mathematical dependencies. In (a), the error in passing time between the child’s age and the study time is shown. It is calculated between any two successive visits, where \(\Delta _{alter }\) is the time passed for the child, and \(\Delta _{time }\) is the time passed for the study. In (b), the error in reconstructing a direct mathematical relationship within a single visit is shown, i.e., the sum of proteins (EW_p), fats (Fett_p) and carbohydrates (KH_p). A perfect reconstruction of the three variables should always sum to 100.

Real-world analyses

To finally judge the usefulness of synthetic data, we investigate and compare their performance on real-world analyses. Because VAMBN and VAMBN-FT show a lower performance in reproducing correlations between variables and direct dependencies of the dataset compared to VAMBN-MT, only VAMBN-MT’s performance on real-world analyses is investigated.

In Fig. 8, the age and time trends predicted by the polynomial mixed-effects regression models for the added sugar intake can be seen. The raw trend values, including significance levels, can be found in the Supplementary Material in Table S4.

The age trend can be reproduced very well by the synthetic data and we see the same progression. Whereas the added sugar increases from the beginning up to the age of 10, it slowly decreases again afterwards. It can be seen that the variance (indicated as confidence intervals by the error bar), gets higher with increasing age. This may be due to the fact that the data basis is more incomplete for older participants. Whereas the same trend can be seen by both module selection settings (i) and (ii), selection (ii) is able to reproduce approximately same values (i.e., in terms of amplitude). In this selection, the two dependent variables age and added sugar are learned together in one module. Nevertheless, we achieve the same trend with module setting (i), presumably because the varying age of the participant is implicitly available to the network by the number of the visit for each module.

The time trend can be approximated with the VAMBN-MT (ii) model (visualised in black in Fig. 8b). In this setting, the variables time and added sugar (\(zuzu_p\)) are learned together in one module. In contrast, the model trained with setting (i), where the two mentioned variables are encoded by two different autoencoders, the overall descending trend cannot be reproduced - i.e., there is no change in added sugar consumption over the years.

Effects of sampling: From each trained model, an infinite amount of datasets can be sampled. Whereas resampling the data lead to no visible changes in the previously shown analyses (i.e., variable distributions, correlations between variables, and direct dependencies), an effect can be seen for the real-world analyses. Here, we realised that the sample size plays an important role for the magnitude of this effect. For \(N=10,000\) (without any post-processing), we see the same overall trend for all analyses, with only small variances in amplitude. However, for \(N=1312\), which corresponds to the amount of samples in the original data, and including missingness, we sometimes even experience differences in the trend. We provide examples of good and bad results in the Supplementary Material in Figure S1.

This finding is in line with the observation that we get much higher variances in terms of confidence intervals for the smaller sample size as compared to the larger one, as can be seen in Figure S2. Additionally, the sample size has an effect on the significance level. For the sample size of 10,000, all of the 100 sampled datasets show statistically significant results for both age and time trends – i.e. they have p-values below 0.05 for all terms (linear, quadratic and cubic). In contrast, for the sample size of 1312, only 70.71% and 22.22% of the samples produce statistically significant age and time trends, respectively.

Concerning computational cost when drawing data sets of larger sizes, the generation time only increases marginally, so it would easily be possible to draw much larger data sets of e.g. 1 million samples. But running the polynomial regression model to determine the time and age trends not only takes significantly longer, but we also noticed that the larger the data set becomes, the more likely this analysis is to fail. So successfully running 100 polynomial regressions with such a large sample size would require generating many more data sets. This is one reason why we consider 10,000 samples to be a realistic upper level. Secondly, 10,000 samples are already about 8 times the original dataset size, and our goal is the release of realistic data, so generating e.g. 1 million rows would be unrealistic in that regard.

Effects of real data quality: The significance levels found in the age and time trends of the real data influence the ability to reproduce the results with synthetic data. In addition to the added sugar, Perrar et al. also investigated age and time trends for the total sugar consumption. Whereas the age trends are all statistically significant (i.e., \(p<0.05\)), this is not the case for the time trend, where both linear and cubic terms have p-values of 0.8951 and 0.1620, respectively. This is reflected in our analyses, as we are able to reproduce the age, but not the time trend. This is summarised in Figure S3 (Supplementary Material).

Figure 8
figure 8

Comparison of age and time trends for added sugar intake predicted by polynomial mixed-effects regression models. The age and time trends of added sugar intake is shown for the real data (red) and the synthetic data (blue setting (i), and black setting (ii)) in subfigures (a) and (b), respectively. Whereas setting (i) corresponds to the original module selections proposed by the experts, in setting (ii) the dependent variables time, age and added sugar are in one module, hence learned by the same autoencoder model. The error bar for the synthetic data indicates the confidence intervals calculated across 100 independently sampled datasets. For all synthetic datasets, the sample size is \(N=10,000\).

Discussion

Due to legitimate privacy concerns, access to personal health data is strictly regulated3, even for scientific purposes. However, the availability of data is fundamentally necessary for scientific progress. For deep learning methods to be successfully applied, it is also crucial that the available training datasets are sufficiently large1. To satisfy both privacy and usability demands, a classic approach is to anonymise datasets, which reduces disclosure risks and may allow sharing of the data even without informed consent. But anonymisation can severely limit the utility of medium to high-dimensional data for statistical analyses4, and there are still recent results proclaiming to be able to re-identify individuals from anonymised data35,36. For these reasons, fully synthetic data generation methods are explored as an alternative to anonymisation. Their idea is that generating and sharing synthetic data may offer a better compromise between privacy of the individuals who appeared in the original data set, and the usability of the shared data37.

A lot of research has been already performed in the area of synthetic data generation methods. Common applications are based on image, text or tabular data generation from various domains (e.g.,14,15,16,17). However, the heterogeneity, the incompleteness, and the longitudinality of the DONALD dataset complicate the construction of generative models that produce useful synthetic data, making the dataset particularly interesting for the development of new methods and their evaluation. Due to the challenging properties of the dataset, the majority of published methods are not suitable for the task at hand. Therefore, we chose VAMBN21 as a baseline method that is indeed designed to handle longitudinal, heterogeneous, and incomplete datasets. It does so by combining HI-VAE30, a generative model that can encode and reconstruct heterogeneous incomplete data, with a Bayesian Network.

In our study, we showed that VAMBN is able to reproduce summary statistics and individual variable distributions effectively. However, pair-wise correlations between different variables in the synthetic data are often not captured well (see Fig. 5). Moreover, VAMBN fails to learn use case specific direct dependencies over time, such as the graduation status of the mother of the participants or the linear correlation between the participant’s age and the time point of the study (see Figs. 6 and 7a).

Therefore, we proposed an extension of VAMBN, called VAMBN-Memorised Time points (MT), that incorporates an LSTM network, which was designed to have a long short-term memory, i.e., to better cope with time dependencies within the data. In our extension, several visits of the same variable were modelled within one module instead of splitting them into different modules that are connected via the BN. Hence, this extension firstly requires each HI-VAE module to map the data for all visits of its variable group to a single encoding, which is a process assisted by the newly introduced LSTM layer. And secondly, the extension must be able to reconstruct all data points of all visits at once from the singular encoding, which is done in a manner that ensures that all dependencies can be learned. In order to judge the improvement from including the LSTM layer, we compared both baseline and VAMBN-MT to VAMBN-Flattened Time points (FT), which also reframes the visits so that they are learned together, but only uses the default feedforward network in the HI-VAE encoder (see Fig. 2).

In terms of pair-wise correlations between variables, VAMBN-MT clearly outperforms the two other approaches (Fig. 5). Also the direct dependencies across visits (i.e., the education status of the mother and the linear relationship between time and age) are greatly improved with VAMBN-MT. An essential part of the quality analysis of synthetic data is to test whether they can be used for real-world analyses. Therefore, we use VAMBN-MT to reproduce time and age trends for added sugar intake, as done by Perrar et al.12. Here, we could successfully reproduce the age trend and approximate the time trend (see Fig. 8). The reproduction of the time trend seems to be more difficult. We theorise that the difficulty of learning a variable trend varies depending on how many variables are involved. If a point in the trend is based on exactly one variable (e.g., added sugar intake at 10 years old in the age trend always corresponds to zuzu_p_VIS07 in the data), learning the trend correctly is comparatively less complex. But if a point in the trend depends on other variables as well (e.g., added sugar intake in the year 2000 in the time trend depends on zuzu_p_VISXX, where the visit XX depends on when the child was born, which can be inferred from time), correctly learning the trend seems to be much harder. Because we realised that the real-world analyses can vary between different datasets that were sampled from the same model (which is not the case for the previous analyses), we systematically determined the the effect of sampling by means of drawing new samples from each model 100 times and varying the sample size by generating datasets of size 1312 (=size of the original data) and 10,000 (with and without post-processing, respectively).

Thus, we draw the following conclusions: Firstly, a larger dataset leads to more stable age and time trends that only differ in amplitude but not in the progression of the trend itself (compare Fig. S1). Larger datasets also show less variance (in terms of confidence intervals) (Fig. S2). Moreover, whereas analyses with a small dataset lead to statistically significant results only in 22%-70% of cases, we get 100% significance for the large datasets. In addition, we could show that the selection of the modules needs to be done in dependence to the research question because correlations are lost if variables are trained with separate autoencoders. Hence, as practical suggestion, we propose to group related variables together and those where correlations between them are known. Moreover, as we showed, the selection depends also on the research question. If researchers want to investigate the association between different variables, those should be together in one module. If the research question is not known in advance and there is little knowledge about the relationships between the variables, they can be grouped automatically using an hierarchical clustering approach. We provide the code in our repository. Finally, we investigated the total sugar intake, that does not show statistically significant results in the real data for linear and quadratic time trends, and conclude that a non-significant trend cannot be reproduced with the synthetic data (see Figure S3).

The present study was designed together with experts from the nutritional domain in order to investigate the quality of synthetic data. Therefore, we focused on a specific data set having specific properties, such as longitudinality and direct dependencies. As we realised some shortcomings of the used baseline algorithm VAMBN, we changed the algorithm’s architecture by making use of an LSTM layer (VAMBN-MT) and could reach an improvement of the synthetic data quality in terms of individual variable distributions, correlations, direct dependencies and real world analyses.In order to test this on a second use case, we chose a further longitudinal data set and compared VAMBN and VAMBN-MT based on individual variable distributions and correlations between variables. With these, we could show slight improvements with our developed algorithm and proof the effectiveness of VAMBN-MT. However, it gets clear that synthetic data generation is very use-case specific and that the correlations between the variables of the DONALD data set are, for example, generally harder to learn than for ADNI. A huge benefit of the developed algorithm lies in the use-case specific experiments, such as the direct dependencies of the DONALD data. Therefore, for future research, we plan to test the algorithm on further longitudinal data sets. However, getting access to them and configuring the algorithms is unfortunately not straightforward.

Altogether, we showed the importance of use case-specific evaluations incorporating expert knowledge, especially in terms of direct dependencies. With the real-world analyses, we got a demonstration of the usefulness of synthetic data and gained valuable insights into the effects of resampling and the chosen sample size, the selection of variables that need to be learned together, and the importance of the significance level of the real-world experiments.

Conclusion and outlook

In this study, we generated synthetic data for a longitudinal study from the nutritional domain and performed an in-depth quality analysis, going beyond summary statistics and individual variable distributions. As we realised restrictions in the reproduction of use case specific direct dependencies across time points of current state-of-the-art methods, we developed VAMBN-MT, an LSTM-based extension of VAMBN that outperforms the original approach and is even able to reproduce real-world analyses. We highlighted the drastic increase in synthetic data quality achieved by incorporating expert domain knowledge and choosing a sufficiently large sample size. We showed that the resulting synthetic data can be a valuable source for real-world analyses. As a next step in our research, we plan to investigate the privacy risks associated with the data generated by our model to gain further insights into the risk-utility trade-off provided and to ultimately unlock the potential benefits of using synthetic data.