1 Introduction

Longitudinal datasets are prevalent in various fields such as health and education. These datasets comprise two types of feature: static (or time-invariant) features recorded at the study’s outset and dynamic (or time-variant) features resembling multivariate time series. However, the dynamic component of longitudinal data often encounters missing values due to factors such as irregular sampling and study dropouts, which adversely affects downstream tasks, including classification [1]. Our study focuses on exploring longitudinal data classification in the context of this prevalent missingness problem.

The inherent missing data problem makes it difficult for generative methods to effectively learn the underlying data distribution [2], and causes suboptimal performance and biased outcomes for discriminative methods [3]. Over the years, various statistical and machine learning methods have been developed for longitudinal classification under such conditions [4, 5]. However, a majority of these methods overlook the importance of (1) assessing how static variables interact with temporal observations at various levels of abstraction, influencing classification outcomes, (2) extracting and using relevant static features at each time step to enhance temporal learning, and (3) evaluating effective strategies to integrate the longitudinal modalities (static and time series) to achieve enhanced prediction outcomes.

In recent years, deep-generative models, such as GANs, have emerged as invaluable tools to generate high-quality samples, beneficial for imputation and classification tasks [6,7,8]. However, many GAN-based methods tend to disregard the static features of the population present in the dataset, which remain consistent throughout the temporal data collection phase. These static covariates can influence temporal observations [1], underscoring the importance of incorporating such dependencies in classification tasks [9].

We conjecture that integrating static covariates into a GAN framework can enhance temporal learning, particularly when addressing inherent missing data within the time series component. We rationalise that static features observed at the study’s outset provide insight into the underlying population distribution from which longitudinal observations are drawn. Therefore, considering their impact on temporal observations can help to improve imputation and classification objectives. By elucidating effective approaches for jointly modelling these modalities (static and time series), researchers can better determine suitable fusion strategies, a crucial aspect for longitudinal datasets with substantial levels of missing values.

This paper proposes fusion-aided imputer-classifier GAN (FaIC-GAN), a conditional GAN implemented with various data fusion approaches, aimed to best utilise the static modality to enhance temporal data learning in the presence of missing data and subsequently improve classification. We introduce four fusion strategies employing early, joint, post- and attention-based fusion in FaIC-GAN. Extensive experiments are conducted to assess the performance of FaIC-GAN under different fusion strategies. Empirical analysis reveals that our post-additive fusion strategy yields the highest overall classification performance, closely followed by our attention fusion strategy. Together, these two multimodal FaIC-GAN models consistently rank among the top three performers across all datasets, even those with high missing data rates and small sample sizes.

To the best of our knowledge, and as confirmed by recent studies [3, 10], this paper represents the first effort to explore the learning of joint representation of static and temporal modalities to enhance a classification objective, specifically within a GAN framework. This paper makes the following specific contributions:

  • We propose an end-to-end conditional GAN model, FaIC-GAN, that leverages partially observed temporal data and static covariates to improve a classification objective.

  • We introduce four multimodal fusion strategies and investigate their differential effects on integrating static and temporal modalities within an imputer-classifier GAN for longitudinal data.

  • We hypothesise, based on our findings, that fusion strategies would have to be chosen depending on the levels of abstraction to which the static and temporal modalities best interact within the dataset.

  • We show through experimental analysis that our post-additive and attention-based fusion strategies perform better than other fusion strategies employed in FaIC-GAN and unimodal models.

2 Related Work

In this paper, we focus on addressing the missing data problem in longitudinal data, joint representation learning in longitudinal multimodal data and exploring GAN-based methods for classifying longitudinal data. We position our work at the intersection of these three areas and provide an overview of related studies.

2.1 Missing Data Problem in Longitudinal Data

Addressing missing values in longitudinal data remains a persistent challenge today [11], particularly due to their adverse effects on downstream tasks such as classification [2]. Given the unique temporal patterns within each time series data, the selection of imputation techniques is often specific to the dataset and/or tasks [12]. The literature shows that (1) no single imputation technique is suitable for all longitudinal datasets [12], (2) the success of a classification task attributed to an imputation technique is closely tied to the specific dataset and (3) neglecting the instance heterogeneity stemming from static features can lead to pseudo-replication, a problem that inflates model performance and affects generalisation [13]. This calls for more objective approaches to handling the missing data problem in longitudinal datasets, ones that are less dataset-centric, and to consider the influence of static features on temporal observations.

2.2 Joint Representation Learning for Longitudinal Data

When dealing with longitudinal data, it is common to employ probabilistic models like generalised mixture models [4] and latent variable models [5]. These models ensure that random effects from static data are considered in the analysis [14]. However, these methods often depend on heavy feature engineering and are computationally expensive. Other methods include transforming either the data or the problem itself. For example, some methods concatenate static data with summaries of temporal features [9, 15], effectively reframing the problem into a non-longitudinal context. Alternatively, static features are repeated at each time step [16], effectively recasting the problem as a purely time series problem with quasi-dynamic data [3].

Recent advancements have led to the consideration of longitudinal data components as distinct modalities for multimodal representation learning [17, 18]. Research has shown that fusion strategies—combining features from multiple modalities into a joint representation—on longitudinal data components can improve various tasks such as sample generation [19], classification or prediction [20,21,22] and risk analysis [23]. Common fusion approaches applied to longitudinal (static and temporal) modalities include (1) early fusion strategies, where static features or their embeddings are concatenated before or after being processed through temporally sensitive networks for temporal learning [19, 22, 24], (2) joint fusion strategies, where latent features from separate modalities are combined just before a prediction or classification task [25, 26], and (3) other strategies, such as using static embeddings to initialise temporal learning in recurrent neural networks (RNNs) [20, 27].

A recent survey [3] outlines three commonly used methods to combine static features with temporal observations, (1) through a fully connected neural network (FC) updated separately, (2) by concatenating static data at each time point and (3) through distinct FC layers within the network. However, the survey identifies a lack of existing works comparing the varied effects of fusion strategies on prediction or classification tasks in longitudinal data. Another recent study [10], exploring feature interactions in multimodal models, stresses the risk of exposure bias when static influences are ignored in longitudinal classification. Upon reviewing existing literature, we identified a gap in comparing the effects of fusion strategies on static and temporal components of longitudinal datasets, particularly within a GAN framework. Therefore, this study investigates four distinct fusion strategies to improve the imputation and classification tasks for the conditional GAN model (FaIC-GAN) adopting the third approach described in the latest survey paper [3]. we compare the performance of our multimodal models against four unimodal models.

2.3 GAN-Based Imputation and Classification of Longitudinal Data

The traditional GAN [28] is a composite model that consists of a generator model G and a discriminator model D. Operating in tandem G attempts to generate novel samples akin to real samples, while D is trained to identify fake samples generated by G. At each iteration, D provides feedback to G, allowing G to adjust its weights to produce better samples in attempts to deceive D again. This cycle continues until the GAN model converges. Upon convergence, G learns a distribution \(p_g\) that aligns with the true distribution \(p_r\), as discerned by D.

GAN-based imputation methods for longitudinal data often use RNNs and professor forcing [29] to predict missing values at subsequent time steps [6, 30]. Other methods utilise temporal decay mechanisms [7, 31], missingness masks and hint vectors [32,33,34] to aid their respective G models in learning the true distribution for imputation. Research indicates that feeding clues regarding missing data patterns to D can assist G in approximating the real distribution [35], thus constraining the solution space for G’s estimation of missing values.

While GAN-based imputation models have been implemented in conjunction with classification objectives, these applications have often focused on stabilising GAN training [34, 36] or enabling semi-supervised learning [37]. Importantly, many of the GAN-based imputation methods have not yet harnessed the static component to address the missing data problem within the temporal component or specifically enhance classification objectives. The literature shows that GAN-based imputation methods often tackle missing data arising from irregular sampling [38, 39]. However, this kind of missingness predominantly affects the time series component of longitudinal data, often leading to the oversight of static features under the assumption that they remain constant over time. This approach, however, violates the non-independence assumption of the time series component of longitudinal datasets [40].

Regarding imputation and classification, employing separate two-step processes to enhance imputation for subsequent classification can risk suboptimal performance [41]. On the contrary, end-to-end approaches allow effective exploration of missing patterns for both tasks [2]. Within the GAN framework, the incorporation of an auxiliary classifier objective has been shown to significantly enhance the imputation goal [42, 43]. Shared model approaches further amplify this improvement, as each task benefits from the learning of the other [34, 36]. In this paper, we employ an end-to-end conditional GAN (FaIC-GAN) to primarily enhance an auxiliary classification objective while concurrently optimising an imputation objective. To achieve this, we introduce temporal masks and indicators to improve missing value estimates via a multiple imputation strategy and learn the true distribution. Our training strategy assumes data to be missing completely at random [44].

Unlike existing methods, we propose conditioning the GAN on partially observed temporal data, seamlessly fused with static information, for informed time series imputation and classification. Specifically, FaIC-GAN enables the integration of four fusion strategies within a conditional imputation GAN that optimises an auxiliary classifier objective with the aid of static modality.

3 Fusion-Aided Imputer-Classifier GAN (FaIC-GAN) for Longitudinal Data

The proposed method, FaIC-GAN, is designed to seamlessly integrate learned features of the static modality with the temporal modality. Employing a joint representation learning strategy, FaIC-GAN improves its ability to estimate the true data distribution. Through repeated sampling from the training set, it performs multiple imputations that allow FaIC-GAN to capture and minimise uncertainties associated with its imputed estimates.

3.1 Problem Statement

Let each instance i in a longitudinal dataset containing N instances be represented as \({\mathcal {X}}_i=\{S_i, X_{i_{1:T}}, y_i\}\). To simplify the notation, we will drop the subscript i. Let \({\mathcal {X}}\) encompass a total of U static variables and V time series variables. The static features are denoted as \(S=\{s^1,...,s^U\}\), while the multivariate time series component is represented as \(X=\{x_1^1,x_1^2,...,x_T^V\}\). At each time step t, a total of V time series measurements are taken over a fixed period of time T. Each instance is assigned a class label \(y\in \{1,..k\}\) from a set of k classes. A temporal mask M is used to indicate the presence or absence of each element \(x_t^v\) in X, such that \(M \in \{0,1\}^{T \times V}\).

The inputs and outputs of the proposed FaIC-GAN model are as follows. For initial estimates of missing values in X, we sample the noise \(z \sim N(0,1) \in Z\). Let the input and output of Generator G be \(\bar{X}\) and \(X_G\), respectively. Let the imputed data \({\hat{X}}\) be the input of the discriminator D. These input and output values are calculated as follows.

$$\begin{aligned} \bar{X}&= M \odot X + (1-M)\odot Z \end{aligned}$$
(1)
$$\begin{aligned} X_G&= G(\bar{X},S,M) \end{aligned}$$
(2)
$$\begin{aligned} {\hat{X}}&= M \odot X + (1-M)\odot X_G \end{aligned}$$
(3)

The \({\hat{X}}\) includes the generated estimates by G for the missing values, which are optimised by the FaIC-GAN model through learning of the temporal mask by D. Thus, one of the objectives of FaIC-GAN’s D is to estimate the mask:

$$\begin{aligned} {\hat{M}}=\arg \max _{M} P({\hat{X}},S,I,\theta_D) \end{aligned}$$
(4)

where each \({\hat{m}}^v_t\) in \({\hat{M}}\) indicates D’s prediction of whether the element \({\hat{x}}^v_t\) in \({\hat{X}}\) is real (observed) or fake (estimated). This approach allows for the improvement of missing data estimates in a focused manner. By generating real/fake prediction for each element of the time series, G can capture the quality of estimates, leading to an improvement in imputation.

To gain accurate estimates of model parameters for D (\(\theta_D\)) and ultimately for G (\(\theta_G\)), we employ the use of temporal indicators I, obtained as:

$$\begin{aligned} I = M \odot \varrho + 0.5(1-\varrho ) \end{aligned}$$
(5)

At each training step, a random variable \(\varrho\) is drawn from a discrete set \(\{0,1\}^{T \times V}\) that determines the amount of clues D receives about which elements in \({\hat{X}}\) were imputed. This partial information in I fed into D, along with \({\hat{X}}\) and static features S, mean D can return specific feedback to G regarding its estimation reliability. With D adversarially tagging G’s estimations (\((1-M)\odot X_G\)) as fakes, G adjusts its weights to learn the true data distribution, thereby refines its estimates.

Within our end-to-end model, D has two objectives, (1) discriminate between real and fake elements of \({\hat{X}}\) (see Eq. (4)) and (2) determine class labels y as follows:

$$\begin{aligned} {\hat{y}}=\arg \max _{y}P(y|{\hat{X}},S,I,\theta_D) \end{aligned}$$
(6)

While often regarded an auxiliary task to imputation, we treat this classification objective as our ultimate goal and seek to improve this objective.

For joint representation learning, we estimate various ways of combining longitudinal data components S and X through transformation functions \({\mathcal {F}}(*)\) as observed in Fig. 2, where \(*\) would indicate various combinations of the modalities or their embeddings. More detail on fusion strategies is given in Sect. 3.4.

3.2 Overview of FaIC-GAN

Using the notations defined in Sect. 3.1, Fig. 1 describes the architecture of the FaIC-GAN model. The core task of time series imputation is enhanced through the effective fusion of static influences with temporal observations as implemented by the fusion strategies illustrated in Fig. 2. During training, the generated sample space \(p_g\) is conditioned on partially observed data \(\bar{X}\), static embeddings \({\hat{S}}\) and the missingness context inferred from M.

Fig. 1
figure 1

The forward propagation process in the FaIC-GAN model involves the joint representation learning of temporal and static features, aimed at improving time series imputation and classification. To illustrate the temporal aspect, the [1 : T] subscript is appended. Fusion strategies, as shown in Fig. 2, are applied within both the Generator G and Discriminator D. Here, the mask M indicates the data that is absent due to the random masking operation performed at each training iteration, in conjunction with the pre-existing missing data

The D is fitted with its dual objectives, as given in Eq. (4) and Eq. (6). Joint training and optimisation of an auxiliary objective help alleviate mode collapse [45]. This phenomenon arises when G becomes fixated on generating samples from a single class (or mode) or a limited subset of classes, consequently causing D to get stuck in local minima. As G undergoes numerous training iterations and performs multiple imputations, it progressively refines its estimate of \(p_r\), ultimately generating more plausible estimates for the existing missing data. Inadvertently, through I, D contributes to the achievement of G’s objective by focusing on the imputed elements. Within the minimax training regime that focuses on imputation, the D in FaIC-GAN enhances its auxiliary classification objective, which we consider our main focus and taking imputation as the means to improve the classification task.

3.3 Losses for FaIC-GAN

To optimise FaIC-GAN for longitudinal data, G aims to bolster the imputation objective (Eq. 4), while D constantly attempts to thwart G’ efforts and improve a classification objective (Eq. 6).

Centred solely on imputation, G tries to minimise the linear combination of two losses: (1) adversarial loss \({\mathcal {L}}_\textrm{adv}\) wherein it attempts to deceive D into thinking its estimates are real and (2) reconstruction loss \({\mathcal {L}}_\textrm{rec}\) wherein it seeks to reconstruct the observed data, thereby enhancing its estimates of the missing values. G’s losses are formulated as:

$$\begin{aligned} {\mathcal {L}}_\textrm{adv}&= -(1-M) \log D({\hat{X}},S,M) \end{aligned}$$
(7)
$$\begin{aligned} {\mathcal {L}}_\textrm{rec}&= \Vert M \odot {X} - M \odot X_G\Vert _2 \end{aligned}$$
(8)
$$\begin{aligned} {\mathcal {L}}_G&= {\mathcal {L}}_\textrm{adv} + \alpha {\mathcal {L}}_\textrm{rec} \end{aligned}$$
(9)

where \(\alpha\) is a balancing hyper-parameter responsible for controlling the extent to which the task of reconstructing observed values contributes to improving G’s capability to generate viable samples for imputation.

D functions as a shared multitask model featuring two output heads: \(D_S\) and \(D_C\). \(D_S\) focuses on Source discrimination, entailing D’s assessment of the source of each element in \({\hat{X}}\) as generated by G or observed. This is analogous to estimating the mask \({\hat{M}}\). On the other hand, \(D_C\) serves as the classification head that outputs Class labels \({\hat{y}}\). The discriminative loss \({\mathcal {L}}_\textrm{dis}\) managed by \(D_S\) and the classification or cross-entropy loss \({\mathcal {L}}_\textrm{cls}\) managed by \(D_C\) are calculated as:

$$\begin{aligned} {\mathcal {L}}_\textrm{dis}&= - (M \log D_S({\hat{X}},S,I) + (1 - M) \log (1 - D_S({\hat{X}},S,I))) \end{aligned}$$
(10)
$$\begin{aligned} {\mathcal {L}}_\textrm{cls}&= D_C(Y \vert {\hat{X}},S,I) \end{aligned}$$
(11)
$$\begin{aligned} {\mathcal {L}}_D&= \delta {\mathcal {L}}_\textrm{dis} + (1-\delta ) {\mathcal {L}}_\textrm{cls} \end{aligned}$$
(12)

where \(\delta\) is a hyper-parameter that balances the discriminative and classification objectives. D attempts to minimise its overall loss \({\mathcal {L}}_D\). During experiments, \(\delta\) was fine-tuned to ensure that if one task held a greater contribution, the other task’s contribution would be proportionately reduced.

3.4 Fusion Strategies for Improved Longitudinal Imputation

In this section, we discuss the specifics of each fusion strategy implemented in D and G, as shown in Fig. 2. The fusion implementations in both D and G are nearly identical, except that D takes \({\hat{X}}\) (Eq. 3) and I (temporal indicator, Eq. 5) as inputs, whereas G takes \(\bar{X}\) (Eq. 1) and M (temporal mask) as inputs, as shown in Fig. 1. All unimodal models, used for benchmarking, utilise the temporal modality only as input.

In the pursuit of augmenting the information content of static modality, in FaIC-GAN, we begin by deriving embeddings of the static features, which we will henceforth refer to as static embeddings, for all our fusion strategies. This choice is made with the premise that static embeddings potentially encapsulate richer information, particularly when dealing with a large number of static features, as exemplified by the ESL dataset presented in Table 1. Additionally, we adopt this approach to ensure an equitable contribution of static influences to temporal features in fusion strategies such as the joint fusion described in Sect. 3.4.2. Subsequently, we combine these embeddings with temporal features through various means to facilitate temporal learning.

Fig. 2
figure 2

Fusion strategies introduced in FaIC-GAN for combining static and temporal components of longitudinal datasets, as implemented in Generator G. Transforming functions \({\mathcal {F}}(*)\) are given with each strategy illustration. The final network layers \({\mathcal {F}}_\textrm{out}\) are either fully connected network (FC) layers or recurrent neural network (RNN) layers that output the generated sample \(X_G\). The \(\bigcirc\), \(\otimes\) and \(\oplus\) represent an element-wise operation of concatenation, multiplication and addition, respectively. Implementation of these fusion strategies in D mirrors that in G, although they vary in their inputs and outputs

3.4.1 Early Fusion

Early fusion is a commonly used fusion strategy [22, 23] that involves concatenating the features of both modalities at a lower abstraction level, before feeding the merged features into the network for subsequent classification. Typically, only a few static features, deemed relevant to the task by domain experts or manual engineering, are fed into the RNN layers alongside the time series. This approach operates under the assumption that cross-modal relationships exist at feature level or lower levels of abstraction, thereby ignoring learning of marginal representations [46]. Consequently, there exists a potential risk of overlooking correlated information present at higher levels of abstraction [47].

Here, we concatenate the static embeddings with the temporal observations and their associated missingness information to facilitate temporal learning. The early fusion approach employed in FaIC-GAN can be summarised as:

$$\begin{aligned} {\hat{S}}&= \text {FC}(S) \\ X_{f}&= \text {Concat}(\bar{X},{\hat{S}},M) \end{aligned}$$

where \({\hat{S}} \in {\mathbb {R}}^{T\times V}\) represents the static embeddings, T represents the number of time steps and V represents the number of time series variables. The static modality is embedded into the same dimensional space as the time series modality, promoting equal contribution to the generative and discriminative processes. The joint feature space, denoted as \(X_{f}\), combines both modalities along with the missingness information via concatenation. Subsequently, this composite input is passed through the RNN layers to generate \(X_{G}\).

Early fusion offers the advantage of early-stage cross-modal interaction learning among different modalities. However, a drawback is its potential to inadequately preserve the intra-modal features as multimodal features are combined at an early stage. Consequently, the distribution characteristics of each modality could be compromised as they progress through subsequent RNN layers. In cases where the static modality has limited relevance to the temporal learning task or desired classification outcomes, the early fusion model may perform satisfactorily.

A variant of this approach involves deriving temporal embeddings and subsequently combining them with static features [48]. However, given that the primary objective of this paper is to investigate the influence of static modality on temporal learning, we abstain from implementing this logic in FaIC-GAN.

3.4.2 Joint (or Intermediate) Fusion

Joint fusion involves combining feature embeddings of one modality with those of another modality within intermediate layers of the network [49]. Consequently, network weights associated with both modalities are updated during the training process [50].

In our FaIC-GAN implementations, we employ two variations for the integration of static embeddings with temporal observations: additive and multiplicative. Adding static influence into temporal information yields a shift in the centre (mean) of the distribution of network weights, whereas multiplying static influence with temporal information leads to an increase in network variance. The joint fusion strategy in FaIC-GAN can be represented as:

$$\begin{aligned} X_{f} = \text {RNN}((\bar{X},M)\circ {\hat{S}}) \end{aligned}$$
(13)

where \({\hat{S}} \in {\mathbb {R}}^{T\times V}\) represents the static embeddings, T represents the number of time steps, V represents the number of time series variables and \(\circ\) represents an element-wise addition or multiplication operation.

Joint fusion presents the advantage of generating enhanced feature representations for each modality during each training iteration. Unlike early fusion, joint fusion adds static features at every time step for each time series variable, potentially enhancing the performance of subsequent tasks. However, in instances where modalities are correlated at higher abstraction levels, this approach may lead to a diminishing effect on temporal learning.

3.4.3 Post-Fusion

Post-fusion is a variation of joint fusion [50], where information from both modalities is combined at high levels of abstraction. In our implementation of post-fusion in FaIC-GAN, we first extract temporal and static feature embeddings independently and subsequently fuse temporal representations from the final time point with static embeddings.

We adopt two variations for the integration of the learned weights from both modalities, employing both additive and multiplicative methods to observe their effects on imputation and classification. The post-fusion strategy in FaIC-GAN can be expressed as:

$$\begin{aligned} h_x =\,&\text {RNN}(\bar{X}, M) \text {;\hspace{5mm}} h_s = \text {FC}(S)\\ X_{f} =\,&h_x \circ h_s \\ \end{aligned}$$

where \(h_x, h_s, X_{f} \in {\mathbb {R}}^{V}\) represent static embeddings, temporal embeddings and multimodal joint representation, respectively. The notation \(\circ\) represents point-wise addition or multiplication operation.

Previous studies have substantiated the utility of adding static influences at higher levels of abstraction [51,52,53]. Unlike joint fusion, the post-fusion strategy ensures that the static information interacts with the time series modality at its highest level of abstraction, just prior to imputation. This deliberate positioning of interaction points could potentially exert a distinct influence on the imputation process and the associated classification task.

3.4.4 Attention-Based Fusion

Recent research highlights the effectiveness of attention-based models for time series modelling and imputation in GANs [54, 55]. In our implementation of attention-based fusion in FaIC-GAN, we employ a scaled dot product attention [56]. It computes a strength-of-relationship score between static features and time series observations at each time point, wherein higher scores indicate more relevant static features for temporal learning. The attention-based fusion strategy in FaIC-GAN can be expressed as:

$$\begin{aligned} \mathrm {\textit{Q}}&= \text {FC}(\bar{X}, M) \text {, } {\mathcal {K}} = FC({\hat{S}}) , \,\,{\mathcal {V}} = \text {FC}({\hat{S}}) \\ \hat{\mathrm {\textit{Q}}}&= \text {Softmax}\left(\frac{\mathrm {\textit{Q}}{\mathcal {K}}^T}{\sqrt{\text {dim}}}\right){\mathcal {V}} \\ X_f&= \text {Concat}(\mathrm {\textit{Q}}, \hat{\mathrm {\textit{Q}}}) \end{aligned}$$

where \(dim = T \times V\) and \(\mathrm {\textit{Q}},{\mathcal {K}},{\mathcal {V}},\hat{\mathrm {\textit{Q}}} \in {\mathbb {R}}^{T \times V}\). \(\mathrm {\textit{Q}}, {\mathcal {K}} \text { and} {\mathcal {V}}\) represent query, key and value combinations, respectively, corresponding to input embeddings from the temporal and static modalities. \(\hat{\mathrm {\textit{Q}}}\) is the “attended” temporal embeddings that have been infused with relevant static influences for each time step. We employ FC to project \((\bar{X},M)\) and \({\hat{S}}\) into the same dimensional space \({\mathbb {R}}^{T \times V}\) to facilitate the dot product calculation (\(\mathrm {\textit{Q}}{\mathcal {K}}^T\)). To prevent very large relationship scores to return very small gradients in the Softmax function, we scale it by taking the square root of dim, which corresponds to the dimension of \({\mathcal {V}}\) (\(\text {and } \mathrm {\textit{Q}}\)). The resulting \(\hat{\mathrm {\textit{Q}}}\) contains information from the static modality that is correlated with the time series modality. Unlike the aforementioned strategies, an attention-based mechanism can accurately model correlated parts between modalities [57]. Ideally, this enables the model to learn only the valuable information from the static modality. A potential limitation of the attention strategy lies in its failure to account for the diminishing effect on temporal learning over time. To address this, we concatenate the original \(\mathrm {\textit{Q}}\) with the attended feature \(\hat{\mathrm {\textit{Q}}}\).

Unlike other methods that apply an attention mechanism to temporal data to enhance the interaction between two modalities [24], the attention-based fusion strategy in FaIC-GAN capitalises on static features that hold relevance for temporal data imputation at each time step. To our knowledge, this novel paradigm has not been previously applied within a GAN framework. Previous works using attention in GANs have aimed to infer relevant medical codes and/or patient visits for rare disease detection [58, 59] and to capture relevant historical Covid case sequences to predict the number of Covid cases [60].

4 Experimental Results and Discussion

The primary objective of the empirical analysis was to observe the superior performance of FaIC-GAN (a multimodal model) over unimodal models while assessing the effect of static features on temporal observations. We systematically examine the varying effects of four fusion strategies on the imputation and classification objectives of the FaIC-GAN model. We used several datasets with diverse characteristics to evaluate the effectiveness of FaIC-GAN. This section includes details of datasets, experimental settings, results for each fusion strategy and an analysis of their relative performance on the datasets. It also provides hyper-parameter tuning results that give insight into how the objectives in FaIC-GAN are affected by two key hyper-parameters, \(\alpha\) in Eq. 9 and \(\varrho\) in Eq. 5. We also conduct ablation studies on the classification performance of FaIC-GAN when fusion is not applied in D.

4.1 Datasets

For our experiments, we used data from the education and medical domains. The Early School Leavers (ESL) dataset [61] records information on a cohort of students across 11 school years. Each record is flagged on whether or not the student dropped out before completing year 12. During preprocessing, we removed unary, derived, duplicate, leaky and zero-variance variables, standardised continuous variables, and had categorical variables one-hot encoded. Data from the first six school years were excluded because they contained only one or two temporal characteristics. Additionally, three open-access medical datasets were used. They were PhysioNet [62] from the 2019 PhysioNet challenge [63] for Sepsis detection, MIMIC-3 [64] from the multibench repository [65] for mortality prediction (whether the patient dies in 1 day, 2 days, 3 days, 1 week, 1 year or longer than 1 year) and OASIS [66] for the progression of dementia in older adults (whether the state of dementia is positive, transitioning to positive or negative). Table 1 describes the specific characteristics of these four datasets.

Table 1 Longitudinal datasets used in experiments

4.2 Experimental Settings

For our training and testing sets, we applied a stratified split of 80% training and 20% testing sets. We apply fivefold cross-validation on OASIS and PhysioNet datasets due to disparities in their size and sparsity, and threefold cross-validation on MIMIC and ESL datasets given their large sizes and relatively low levels of missingness. The abundance of samples in these latter two datasets mitigates the potential bias towards specific hyper-parameter values during training, thereby reducing the risk of overfitting. Due to the imbalance of all 4 datasets (Table 1), the training sets were oversampled using SMOTE [67]. This was considered necessary as preliminary experiments revealed a significant drop in model performance without the re-sampling.

During training, we randomly mask 10% of the data. During testing, we systematically increase the amount of missingness in the temporal data ranging from 20 to 80%. This approach allows us to assess the performance of each fusion strategy across varying degrees of missingness. Tables 2, 3, 4 and 5 present the results at increasing levels of missingness, where 0.2 missing means 20% of the temporal data (or multivariate time series) in the test sets are missing (masked).

We utilise the ADAM optimiser [68] and heuristically selected model settings according to the number of training epochs in the range of {10,20}. The learning rate was chosen empirically from the grid of {0.00001,0.0001,0.001,0.01}, and the \(\delta\) hyper-parameter in Eq. (12) was selected from the {0.1,0.3,0.5,0.7,0.9} grid. We tested batch sizes from the grids of {5,10,20} for the smaller OASIS dataset, and {64,128,256,512} for the three larger datasets. Batch size 10 was chosen for OASIS and 512 for PhysioNet, ESL and MIMIC-3. Preliminary tests showed that larger batch sizes gave better results in terms of training stability and generalisation.

Initially, we employed four unimodal models denoted with prefix “uni-” (see Tables 2, 3, 4 and 5) to determine the best neural network model for learning the temporal modality in our datasets. Among the unimodal models, two are temporally sensitive: (1) gated recurrent units (Uni-GRU) and (2) long short-term memory (Uni-LSTM). The other two non-temporally sensitive models are: (3) a fully connected network (Uni-FC), where \({\hat{X}}\) is flattened to dimensions \([T \times V]\) before being fed into the model, and (4) a 2-D convolutional neural network (Uni-CNN) with input dimensions of height V and length T, without pooling or up/down sampling. These non-temporally sensitive models are chosen because RNNs’ long-term dependencies may not always be relevant in certain contexts [69]. The LSTM model emerged as the most effective unimodal model, leading us to incorporate it into our FaIC-GAN models. Results for all unimodal models are included to justify our selection of LSTM in FaIC-GAN, and underscore the importance of leveraging multiple modalities in classification tasks.

Since we lack ground truth due to inherent missing data in the datasets, we do not report the imputation results [36]. For each experiment, we record accuracy, weighted F1 score and area under the curve (AUC) to assess the classification performance of FaIC-GAN under various settings. To improve model performances, we applied one-vs-one (ovo) AUC during training for multiclass datasets like MIMIC-3 and OASIS (as shown in Tables 4 and 5), where an average AUC of all possible class combinations is calculated. Past studies [70] show that combining ovo with SMOTE yields the best classification results for imbalanced datasets, we have thereby applied both in our experiments.

4.3 Model Performances

This section presents the classification performance for each dataset, drawing comparisons between different FaIC-GAN and unimodal models. This study is driven by the objective of classification, thereby best models were selected based on the highest F1 scores. In Tables 2, 3, 4 and 5, the results highlighted in bold indicate the best performance, those underlined show the second-best performance and the italicised results show the third-best performance.

4.3.1 ESL Results

Results for FaIC-GAN and unimodal models on ESL are presented in Table 2. The Uni-LSTM model performs best in terms of AUC, showing the temporal modality to be discriminative on its own. However, as missingness increases, classification performance declines drastically for all unimodal models. Although the post-additive FaIC-GAN model shows lower AUC scores in comparison with the uni-LSTM (Table 2), tests of significance performed (Table 6) show post-add FaIC-GAN to have significantly stronger discriminative ability than uni-LSTM. The lower Acc and F1 scores of uni-LSTM would indicate that ESL’s shorter sequence length does not allow LSTM to extract enough temporal context for more precise predictions.

For ESL, static influences appear to have a substantial impact on classification. This can be observed by the boosted performance of FaIC-GAN models at higher rates of missingness in the temporal modality. The post-add FaIC-GAN model’s enhanced performance at higher levels of missingness indicates that Intra-modal features contribute more effectively to classification at higher levels of abstraction.

Compared to the other datasets, the ESL dataset yields the most promising performance by our FaIC-GAN models. This could potentially be attributed to the large number of static features that contribute more to the classification task. The multimodal models outperform all unimodal models, although their distinct effects are less pronounced at lower levels of missingness. The robust performance of joint-multiplicative FaIC-GAN and attention FaIC-GAN (at higher missing rates) indicates the introduction of added variability in the temporal space through static influences indeed aids the classification performance on this dataset, particularly in the presence of missing data.

Table 2 ESL test dataset: classification performance of each fusion strategy and unimodal model under increasing levels of missing data

4.3.2 PhysioNet Results

Results in Table 3 show several fusion strategies on FaIC-GAN that perform highly. The early FaIC-GAN model displays greater confidence in its classification outputs than other models at higher missing rates. This may indicate more defined interactions of the modalities at lower levels of abstraction. In terms of model confidence, the attention and post-add FaIC-GAN models do well at lower levels of missingness, their declined confidence as introduced missingness increases is also observed in other studies with this dataset [71].

The comparable performance of Uni-LSTM to the FaIC-GAN models with higher missing data rates indicates reduced variability in samples for discrimination, despite the inclusion of static modality. With 80% missing values, PhysioNet experiences disruptions in temporal correlations [2], leading to error propagation along the sequence that affects the classification task [36]. Consequently, cross-modal interactions that might exist at higher abstraction levels are curtailed, while those at lower abstraction levels become more pertinent to classification. Extracting complementary information from the modalities at lower abstraction levels enables the Early FaIC-GAN model to gain a more holistic understanding of the data distribution, ultimately leading to better classification performance.

The ablation study presented in Sect. 4.4.3 provides empirical evidence showcasing performance improvements ranging from 20-60% for PhysioNet when D incorporates multimodal features. Although the diverse effects of fusion strategies are less pronounced for this dataset and the FaIC-GAN does not show distinct differentiation from the unimodal uni-LSTM at high levels of data missingness, significance tests (see Table 6) show that the post-add FaIC-GAN significantly improves predictive and discriminative capabilities over the unimodal uni-LSTM model.

Table 3 PhysioNet test dataset: classification performance of each fusion strategy and unimodal model under increasing levels of missing data

4.3.3 MIMIC-3 Results

The performance of various FaIC-GAN models on the MIMIC dataset is presented in Table 4. The temporal component of this dataset contains no inherently missing values. Here, the post-FaIC-GAN models show a greater ability to differentiate between classes than other models. AUC scores also indicate that the uni-LSTM model, although showing better classification performance, becomes less confident in its predictions compared to the FaIC-GAN models, as missingness increases.

For Acc and F1 scores, the uni-LSTM model’s performance appears to indicate a lack of reliance on the static modality for the classification of this dataset. Also, MIMIC-3’s longer sequence length allows LSTM to fully utilise its ability to learn longer temporal dependencies, thereby leading to improved predictions. One explanation for why fusing the static modality could cause lower AUC scores in the multimodal models is that the static modality is less discriminative, potentially introducing ambiguity into the temporal data space and resulting in weak structural organisation [72], which in turn could lead to class overlap and affect predictions of FaIC-GAN models. This sensitivity to the integration of the static modality is the reason for the differential effects of the fusion strategies being more pronounced for this dataset. However, the comparable performance of attention and joint-additive FaIC-GAN models indicates that incorporating static features for each time step can influence the learning of temporal information.

Marginal classification performance may arise from factors such as (1) cumulative temporal disruptions along the longer sequence length [2], (2) high dimensionality in the temporal component [73] and (3) extreme class imbalance in five out of six classes, where the model may struggle to adequately learn minority classes. Given that only the training set underwent oversampling, the interpolated samples from SMOTE may have either (1) lacked diversity, leading to overfitting to hyper-parameter settings [74], or (2) introduced variations into the training set not encountered in the testing set. Such conditions can detrimentally impact the classifier’s discriminative capacity and its generalisation ability.

Table 4 MIMIC-3 test dataset: classification performance of each fusion strategy and unimodal model under increasing levels of missing data

4.3.4 OASIS Results

Table 5 presents the results for the OASIS dataset. Here, the post- (additive and multiplicative) and attention strategies in FaIC-GAN show better classification performance, as well as higher class differentiating ability. The performance of these multimodal models indicates that for this dataset, there are significant interactions between the static and temporal modalities over time. Research also indicate strong links between dementia (outcome) and measured static features such as social-economic status and education level [75]. We observe for most FaIC-GAN models that classification scores fluctuate as missing data rates increase. This seems to indicate that as noise increases within the temporal modality, the multimodal models become more dependent on the static modality (to varying extents) to determine the classification outcome.

While AUC scores show that the post- and attention FaIC-GAN models do better at distinguishing between multiclass outcomes, the general low classification results show the unique challenges posed by the OASIS dataset. The dataset’s small size, short sequence length, limited static features, multiple classes and class distribution imbalance collectively hinder the classification objective. This may be due to overfitting to hyper-parameters during training, leading to poor generalisation [76]. Interestingly, the uni-CNN model outperforms all other models at low data missingness levels. This indicates that low-level patterns or short-term relationships among temporal features offer more informative cues for the classification task.

Table 5 OASIS test dataset: classification performance of each fusion strategy and unimodal model under increasing levels of missing data

4.3.5 Analysis of Relative Performance

The datasets differed in size, sparsity, length of sequence, number of classes and class imbalance ratio. Thus, the effects of the different fusion strategies on FaIC-GAN when applied to the datasets revealed aspects of the modalities’ interactability at different levels of abstraction, particularly in the case of missing temporal data.

The results in Tables 2, 3, 4 and 5 show that the post-additive and attention-based fusion strategies outperform unimodal and other multimodal FaIC-GAN baselines. However, to further substantiate our findings, we run a statistical significance test on our dominant multimodal model (post-additive FaIC-GAN) and the best performing unimodal model (uni-LSTM) to observe if there are significant improvements over the unimodal baseline.

We conducted student t-tests on independent samples from 50-plus runs of the uni-LSTM model and the post-additive FaIC-GAN model at a 60% missing ratio. For the large ESL and MIMIC-3 datasets, we did \(18\times 3\) cross-validations (CV) runs and \(10\times 5\)-CV runs for PhysioNet and OASIS. Table 6 shows the mean accuracy, F1 and AUC scores. The model (post-additive FaIC-GAN or uni-LSTM) whose mean scores are higher with P-values \(\le 0.05\) offer significantly better performance over the others.

Table 6 Test of significance at 60% missing rate

P-values on all mean AUC scores in Table 6 indicate that our post-additive model has significantly higher discriminative abilities than the uni-LSTM model. It also shows significant predictive dominance over the uni-LSTM for all datasets except MIMIC-3. This may be due to static features adding ambiguity into the temporal learning space, as discussed in Sect. 4.3.3. We discuss further the relative effects of the fusion strategies on the different datasets in Sect. 4.5.

4.4 Sensitivity Analysis and Ablation Study

The loss functions of the FaIC-GAN models depend on three parameters: p_hint or \((1-\varrho )\) (Eq. 5), alpha \(\alpha\) (Eq. 9) and delta \(\delta\) (Eq. 12). It is worth mentioning that we empirically tested the sensitivity of \(\delta\), and found that both classification and imputation tasks remain unaffected by this parameter, compared to \(\varrho\) and \(\alpha\). Therefore, we present the results of the sensitivity analysis conducted on \(\varrho\) and \(\alpha\). We observe the effect of \(\varrho\) and \(\alpha\) values on all fusion-based FaIC-GAN and unimodal Uni-LSTM models, reporting the results on the ESL dataset.

4.4.1 Effect of the p_hint Hyper-Parameter

The p_hint value determines the extent to which information is disclosed to D regarding which features in the observed imputed sample (\({\hat{X}}\) in Eq. 3) are real and which are estimated. p_hint is selected from the grid of {0.1,0.3,0.5,0.7,0.9} where a value of 0.9 implies that only 10% of the information in M is disclosed to D. The effect of increasing the levels of mask information passed to D on the performance of FaIC-GAN and Uni-LSTM models is shown in Fig. 3 and Fig. 4, respectively. Except for the post-additive FaIC-GAN model that shows lower RMSE scores, p_hint appears to have minimal impact on imputation performance for different fusion strategies.

Regarding classification, Fig. 3.f shows that accuracy decreases for the post-additive FaIC-GAN model as p_hint decreases (i.e. as more information about the temporal mask is passed to D). A similar behaviour can be observed in the attention FaIC-GAN (Fig. 3.b) and more distinctly in the post-multiplicative FaIC-GAN (Fig. 3.d). This suggests that, for this dataset, the classification objective in Eq. 6 benefits more when D is less informed about the temporal modality’s missingness, in contrast to the contribution from the static modality. Notably, Uni-LSTM shows that increasing the amount of hints provided to D about the missingness context assists in learning the conditional distribution of the temporal modality, although its accuracy scores remain suboptimal (Fig. 4).

Fig. 3
figure 3

Effect of \(p\_hint\) on the performance of four fusion based FaIC-GAN models

Fig. 4
figure 4

Effect of \(p\_hint\) on the performance of unimodal (Uni-LSTM) model

4.4.2 Effect of the alpha Hyper-Parameter

The alpha \(\alpha\) value regulates the contribution of the reconstruction task on G’s update (Eq. 9). Alpha is chosen from the grid {0.01,0.1,1.0,10,100}; a value of \(\alpha =0.1\) implies that the reconstruction (RMSE) loss contributes only 10% to the improvement of G in generating viable samples for imputation. The results in Fig. 5 show the effect of increasing alpha values on accuracy, AUC and RMSE in multimodal FaIC-GAN models. We observe a lower sensitivity of the imputation objective to changes in alpha, indicating that the improvement of G in its reconstruction task may have a comparatively smaller effect than on its adversarial task. Similar behaviour is observed for all fusion strategies, except for the post-additive FaIC-GAN model (Fig. 5.e) wherein we observe a better imputation performance and the highest gain in classification performance at \(\alpha =10\). The attention FaIC-GAN shows an increase in classification performance when G focuses more on its reconstructive ability.

Figure 6 reveals that increasing the contribution of the reconstruction task by 10 (i.e. \(\alpha =10\)) in Uni-LSTM assists classification (as seen by the improved Accuracy and AUC scores), but does nothing for imputation (RMSE). A higher AUC value indicates greater confidence in distinguishing between classes using the temporal modality.

The multiplicative strategies (joint and post) show high accuracy despite G placing emphasis on its reconstructive effort. This indicates that increasing variance into the temporal sampling space with static influences may be benefiting the adversarial objective, which in turn informs the classification objective.

Fig. 5
figure 5

Effect of alpha on the performance of four fusion based FaIC-GAN models

Fig. 6
figure 6

Effect of alpha on the performance of unimodal (Uni-LSTM) model

4.4.3 Discriminator Without Multimodal Fusion Versus Unimodal

The proposed FaIC-GAN model (as shown in Fig. 1) includes both D and G within a multimodal fusion framework. In doing so, the FaIC-GAN model regularises the joint learning tasks and mitigates the challenges associated with dual objectives in a multitask shared model approach [77]. To assess the significance of implementing D in a multimodal fusion framework, we conducted an ablation study in which D is implemented as unimodal (Uni-LSTM). The rationale is that integrating static data with temporal observations in G could improve the accuracy of missing value estimates and consequently improve time series classification performance (implicitly).

We evaluated the performance of both unimodal and multimodal D using the PhysioNet dataset. The results in Table 7 show that using unimodal D with FaIC-GAN leads to suboptimal accuracy. On the other hand, FaIC-GAN with multimodal D produces classification outcomes with improvements of up to 50% in most F1 scores and up to a 40% in AUCs. Notably, RMSE values picked up by 5–10%. It appears that prior to enhancing D for improved classification, the imputation task tied to the adversarial objective dominated FaIC-GAN’s learning process. However, when D was enriched with multimodal learning, imputation performance slightly declined. Despite this, classification performance improved across all FaIC-GAN models.

By virtue of the end-to-end design and a shared multitask mechanism in D, FaIC-GAN has the advantage of concurrently optimising imputation and classification tasks [36], allowing each task to benefit from the other’s learning [2]. Despite the boosted classification performance, our results highlight the importance of regularising joint learning tasks and addressing the potential trade-offs associated with dual objectives in a multitask shared model approach [77].

Table 7 An ablation study with the unimodal discriminator D in FaIC-GAN model

4.5 Discussion: Effects of Fusion Strategies in FaIC-GAN

In summary, our results show the multimodal FaIC-GAN models to significantly outperform the unimodal models. Dataset properties like sequence length, class imbalance ratio and existing levels of missing data were shown to influence model outputs. Given the varying characteristics of the datasets, no individual fusion strategy in FaIC-GAN or any of the unimodal models emerged as the single best performer across all datasets, at all levels of data missingness. However, our post-additive fusion and attention fusion strategies consistently exceeded other baselines in terms of their relative superior performance across the datasets; this is highlighted and discussed throughout Sect. 4.3.

As the most straightforward approach, early fusion is easy to apply and thus broadly used in existing works [22, 23]. However, it may fail to preserve the intra-modal features within a modality for the classification. On the other hand, the joint and attention fusion strategies add the static features at each time step, which is less likely to deteriorate the temporal characteristic for the temporal modality. The primary difference between the joint and post fusion strategies lies in their application at different levels of abstraction. Specifically, joint fusion combines two modalities prior to the feature extraction layers. As a result, the static modality may assist temporal feature learning. The post-fusion combines two modalities after extracting the temporal features. Consequently, the static component is expected to directly assist in the generation of missing values. Unlike other fusion strategies, attention fusion considers the correlation between modalities. It aims to eliminate irrelevant features from the static component, allowing the static modality to enhance the learning of temporal features. This effect is shown in our experimental results.

The effectiveness of the post-additive FaIC-GAN model shows that interactions between the modalities at higher levels of abstraction allow for the classification task to benefit equally from both modalities. In cases where deficiencies exist for one modality, for example, if the temporal modality is too noisy, or the static modality ambiguous, the model is able to leverage the other modality’s learned representations to compensate for the deficiency. The attention-based FaIC-GAN model shows static features to have varying effects on temporal observations over time; by capturing only relevant cross-modal interactions at each time point, it better informs both temporal learning and the classification objective.

Suboptimal performance was observed for fusion strategies at times when (1) integrating static influences at lower levels of abstraction weakened their impact over time due to the vanishing gradient problem inherent to RNNs and (2) giving equal importance to static and temporal feature interactions either from the outset or for all time steps did not allow the model to adequately capture complementary and correlated information present in modalities. However, strong correlations were observed to exist at lower levels of abstraction when the temporal modality was particularly noisy (see PhysioNet results in Sect. 4.3.2). Disruptions in temporal correlations [2] would have reduced the effect of cross-modal interactions at higher levels of abstraction, causing those at lower abstraction levels to be more pertinent to the classification task. Furthermore, complementary information extracted from the modalities at lower levels of abstraction allows the model to gain a more comprehensive understanding of the data distribution for better classification.

Differential effects of the fusion strategies on FaIC-GAN were found to be less pronounced in the PhysioNet and ESL datasets, than in the MIMIC-3 and OASIS datasets. ESL results indicate that a highly informative static component, when combined with a noisy temporal component, would cause the static modality to dominate despite the fusion strategy. PhysioNet with its 80% missing data, indicates that highly sparse datasets can render efforts to draw correlated information from the modalities ineffective unless considerations are given to improve cross-modal interactions between the modalities under such conditions. MIMIC-3 dataset with its fully observed time series shows that a highly informative temporal modality would be more sensitive to fusion strategies, particularly those that emphasise shared information from the modalities that inadvertently suppress the temporal modality’s unique discriminative information. Results from OASIS with its small data size, class imbalance and other issues suggest multiple factors to be affecting performance of fusion strategies in FaIC-GAN. While the model may be overfitting for the OASIS dataset, we observe that (1) selecting network settings that best draw out individual and correlated information from the longitudinal modalities, (2) applying effective regularisation measures that ensure stable outputs and (3) increasing sample size or incorporating other modalities would prove beneficial to classifying such datasets.

5 Conclusion

In this paper, we propose a conditional imputer-classifier GAN, FaIC-GAN, for longitudinal data that attempts to leverage the static modality in enhancing temporal learning for classification. We perform this in the presence of inherent missing data. By employing a joint representation learning approach in FaIC-GAN, a multimodal fusion strategy effectively combines static influences to the temporal features, particularly when faced with missing data, and becomes informative to temporal learning to boost classification.

The performance of FaIC-GAN models on different datasets revealed the interactability of longitudinal modalities at various levels of abstraction under missing data conditions. We note that classification performance was generally better for datasets with larger sizes, a more informative static modality and binary classes (e.g. ESL and PhysioNet). The larger sample size gave our models greater generalisation power and more informative static features (particularly for ESL and OASIS) meant the model could leverage the static modality’s enriched representations to inform classification better when the temporal modality is noisy. We attribute the comparatively poor performance of FaIC-GAN on the multiclass datasets (OASIS and MIMIC-3) to the extreme class imbalance that exists in these datasets. This, combined with the oversampling technique, would have led to overfitting and poor generalisation.

Differential effects of the fusion strategies on the static and temporal modalities for classification show dataset characteristics and the outcome of interest to determine the effectiveness of the multimodal approach. We recommend first establishing the informativeness of individual modalities to the classification task, and offer the following guidance when choosing the fusion strategy for longitudinal classification with missing data: (1) at lower levels of abstraction if the main temporal modality appears too noisy, (2) at intermediate or over each time step when only relevant cross-modal features are extracted to optimise the classification objective and (3) at higher levels of abstraction if the temporal data is highly discriminative or if unsure.

Possible limitations to our approach can be seen, where integration of the static modality appears to adversely affect the model’s predictive performance. This is observed for MIMIC-3 dataset where there is minimal or no missing data. We hope to explore effective fusion strategies that will not only preserve but enhance temporal learning in such cases. Also, the performance of the small-sized dataset OASIS would indicate that the modalities are overfitting and generalising at different rates [78]. We hope to investigate such behaviour in small and large datasets as well. We also aim to improve our fusion strategies to further enhance imputation in GANs to boost imbalanced classification, particularly for multiclass classification.