1 Introduction

Patient–ventilator asynchrony (PVA) refers to the desynchronisation between ventilator cycles and patients’ breathing efforts during mechanical ventilation. For non-invasive ventilation, it is known that excessive PVA can lead to undesirable consequences such as discomfort, reduced sleep quality, and potential ineffective ventilation. [1] While typically detecting PVA requires bedside monitoring by specialists, which is a labour-intensive process that is not always available when needed, recently there have been many efforts to automate this process using machine learning (ML) models. [2,3,4,5] Generally, we may describe the problem as follows:

Given a collection of continuous ventilator waveform recordings, each segmented into breathing cycles and given event class annotations per breath by human experts (assuming no annotation = “normal" label), build an ML model which predicts the correct class label given a breathing cycle and its context.

Fig. 1
figure 1

A sample segment of relevant channels in a ventilator monitoring signal dataset. Notice that while the pressure (Pmask) and to a lesser extent Flow channels tends to have cleaner signals, the effort belt readings (Thor and Abdo) often contain much noisier readings, which cannot be satisfyingly dealt with using simple signal filtering

However, most existing methods focus on invasive ventilation, which often happens in a well-controlled environment, allowing higher quality data to be collected. For non-invasive ventilation, the quality of collected data is usually much lower for a variety of reasons, such as the patient not being sedated, higher chances of air leaks, patient sleep state changes, possible interruptions to ventilation, etc. There also exists a high degree of variability between observed waveform data outputs from different patients, different observation periods and different ventilator settings. The waveform of the same type of PVA event may appear very different visually in different context. Moreover, machine learning models rely on manual annotation by human experts to provide supervision signals, yet given the large data quantity and high data variability, a human annotator is unlikely to always label event instances consistently (e.g. a borderline event instance may be more likely to be labelled as anomalous when surrounded by normal cycles than when between more prominent PVA events), resulting in noisy labels with class overlaps. Such label noise is also often not symmetric—it is more likely for an anomalous event to be missed than for a completely normal cycle to be tagged as anomalous.

Fig. 2
figure 2

Example of autocycle/multi-trigger, double trigger and ineffective effort events, respectively. Note that many of these events must be identified with correlated features across channels, there exist both natural and abnormal ventilation cycle length variations, and that the general waveform pattern in a larger context window may impact how an anomalous event looks like

Overall, apart from classification accuracy, there are three main challenges in applying machine learning to non-invasive PVA detection:

  • High variability within and across event classes - we need data representations that can capture rich signal morphology, such as:

    • multiple interacting channels: PVA data have interdependent data channels such as mask pressure and airflow. We must not only model relationships within data channels but also across them.

    • variable length input intervals: the basic unit of analysis, i.e. individual ventilation cycles have different durations. We need a method to compare them without introducing unwanted length-based biasesFootnote 1.

    • both time-dependent and time-independent features: ventilation cycles have variable lengths, and while some of these variations are normal, some cycle length variations are related to PVA events such as double triggering. We need to be able to “look past" trivial variations while not missing important time-dependent changes.

    • multiscale context windows: clinicians often score PVA events not in isolation but in context, so must automatic algorithms. Close-range context like the immediate, previous, and next cycle, are important for recognising certain event types like multiple triggering, whereas a wider context window is often needed to figure out how likely a patient may experience PVA events at the current time. We need to integrate such information from multiple context window scales.

  • Waveform data noise - a data modelling method that is robust to input noise and inherent data uncertainty is necessary

  • Asymmetric label noise - a classification strategy that is robust to errors and ambiguities in class labels is necessary

To find a solution to these challenges, we break the overall problem into two sub-problems: finding an effective feature representation of ventilator waveform data, and designing an ML workflow that can deal with multiple sources of uncertainty.

On feature representation, we assess various types of traditional respiration domain as well as generic time series features based on four main requirements needed for the PVA task: ability to describe multiple interaction channels; compatibility with variable length input; the inclusion of both time-dependent and time-independent features; the ease of combining features across multiple context windows. We also consider their other properties such as robustness to noise and computational efficiency. In the end, we select path signature features as our preferred waveform data representation method.

With classification workflow, and this much inherent uncertainty and variability in our data, it is no longer sufficient for an ML model to only produce a class prediction or class probabilities. To make informed decisions based on ML model inferences, the clinician would also need to know some measure of confidence for each ML prediction, as a human expert would have. The model needs to produce well-calibrated class probabilities, provide an indication of “insufficient evidence" cases from its predictions, and be able to incorporate knowledge about noisy labels.

In this work, we propose a (conditional) latent Gaussian mixture generative classifier with noisy label correction (GMGC-NL) to tackle these challenges. The latent Gaussian mixture model uses an invertible neural network (INN) [6] to transform signal features into a Gaussian mixture, allowing us to use a generative approach to model input data distribution and obtain data generative probabilities (how well an observed instance is explained by the learned model), which in turn enables us to identify cases that are dissimilar to any of the typical event class patterns, i.e. out of training data distribution. The generative classifier [7] part balances the two objectives of producing accurate predictions and providing input data probability estimates. And finally, we incorporate our knowledge about class-conditional label noise [8] into the model itself, which greatly improves our uncertainty estimates.

In Fig. 3, we provide an overview of the challenges from different aspects of the PVA detection problem and our proposed solutions.

Fig. 3
figure 3

Overview of problem breakdown, challenges and proposed solutions of this research

Our contributions are as follows:

  • We propose a novel conditional latent Gaussian mixture generative classifier model with noisy label correction, which offers superior calibration, provides more consistent predictions as human annotators, and directly incorporates knowledge about label noise with only very minor impact on a few of the classification performance metrics.

  • We show that a general sequence representation method, the path signature, provides an effective feature set for multichannel variable-length continuous sequence data such as ventilator recordings, and is naturally suitable for use in deep generative models.

  • Furthermore, we show that our model also provides a sensible and interpretable Out-of-Distribution (OoD) score per input instance based on generative modelling to measure the compliance of an unseen instance to the trained model, and in turn identify cases for which we lack sufficient evidence for decision-making, avoiding the “confidently incorrect" problem discriminative classifiers often have with OoD examples.

2 Related work

2.1 PVA detection Features

As mentioned in the last section, an ideal PVA detection algorithm should utilise features that are able to capture ventilator monitoring signal’s rich morphology, in addition to common practical concerns such as efficiency and robustness to noise. In this section, we argue that common feature representation methods used in PVA detection do not satisfy all these requirements.

Most current automatic PVA detection methods rely on hand-crafted rules [9, 10] or use generic machine learning algorithms with hand-picked, clinically relevant features [2, 11, 12]. The rule-based models are based on domain knowledge and tends to capture the most important qualities of PVA events well, but the specific implementations of the rules can be highly sensitive to signal noise, e.g. the method in [9] relies on analysing the relationship between max pressure and the rate of change of flow, both of which may be hard to estimate and prone to fluctuation in noisy signals. Some rule-based methods rely on the existence of high-quality auxiliary signals such as EAdi [10], which is not applicable to non-invasive ventilation.

Table 1 Comparison of common feature types used in PVA classification

On the other hand, feature-based ML classifiers are able to look for more characteristic traits of anomalous events, but are often less interpretable than simple rules. In [2] the authors derived a set of physiologically relevant features of raw breath data and classify them based on comparisons with “gold standard" batches. However, to ensure clinical relevance of predictions and eliminate the impact of data artefacts, this algorithm requires an elaborate fine-tuning process, which may not be easily replicable on a new dataset with different waveform characteristics. An algorithm based on a small set of features is also prone to noise if the feature extractor is not sufficiently robust to it, e.g. numerical derivation and integration may not be stable when there exists significant small fluctuations or sudden spiky artefacts in the data.

Overall, while these rule-based or hand-crafted methods are designed to capture rich signal morphology as outlined in the introduction section, their weakness lies in their lack of robustness when it comes to data quality.

While it is possible to use generic time series feature extractors like Fourier transform, wavelet transform, tsfresh [13], Catch-22 [14], random convolution [15] for PVA detection, compared to hand-crafted features, their main limitation is that only a subset of all available features provide capabilities such as describing channel interactions, ensuring length independence, etc. This pertains to a need for a large library of diverse features to sufficiently capture signal morphology, increasing computational cost and lowering interpretability. Though ideally, an efficient set of features that are quick to compute and exhibit high representation power is what is necessary.

Some recent methods applied deep learning techniques to PVA detection to directly classifying the raw waveform data either with recurrent [3] or convolutional [16] architecture. These models are capable of adapting to both rich signal morphology and data noise, although most deep learning architectures cannot directly accept variable length sequences as basic units of input, and require some form of padding or length normalisation. Common convolutional or recurrent architectures also extract inherently time-dependent features (via fixed-length convolution or recurrent windows), making them less efficient at capturing similarities between ventilation cycles where only their duration differ. Deep learning approaches also suffer from low interpretability and relatively high computational cost, although these disadvantages may be offset by higher classification performance and additional efforts to justify model predictions, such as through uncertainty quantification.

In Table 1, we offer a summary of existing feature choices for PVA event classification, outlining their main strengths and weaknesses. This lack of a single feature set satisfying most problem requirements and practical concerns prompts us to explore features beyond the usual suspects, such as the path signature.

2.2 Uncertainty quantification

In the previous subsection, we mentioned the strong performance of deep learning-based models. However, most of these methods are designed for invasive ventilation, where the data collected are less noisy. For non-invasive ventilation, the data signal-to-noise ratio tends to be much lower, and these models do not have built-in capabilities to deal with such issues.

Lack of interpretability and uncertainty quantification are often barriers to applying deep learning models in clinical practice, which is the case for typical deep discriminative classifiers. A large part of the issue is that the class probabilities \({\mathcal {P}}(y|x)\) which discriminative models produce are insufficient to tell us whether we have enough confidence or not for a prediction. It would be ideal to also obtain the data probability under the generative model \({\mathcal {P}}(x)\) or the class conditional \({\mathcal {P}}(x|y)\), which would help us identify input instances that are out-of-distribution / too dissimilar from typical event class patterns.

There exist two main pathways to uncertainty quantification—putting the uncertainty in model parameters, or directly building a (usually deterministic) model of the uncertainty itself. Estimating model parameter uncertainty (i.e. building a Bayesian neural network) relies on applying the Bayes rule \({\mathcal {P}}(\Theta |X) \propto {\mathcal {P}}(X|\Theta ){\mathcal {P}}(\Theta )\), which is then approximately evaluated with Monte Carlo methods. For deep learning models, this includes methods like Deep Ensemble [17] and MC Dropout [18]. The main drawback of this approach is the computational cost from having to train multiple models or evaluate a model multiple times.

Recently there exist many new developments in using deterministic models to estimate epistemic uncertainty [19]. Among these, flow-based models [20, 21] utilise an invertible network to directly learn the input data distribution by finding a bijection between the data and a tractable distribution, such as a Gaussian. Invertible generative classifiers [7] builds upon normalising flows to find a bijection between input data and a Gaussian mixture with N (N = number of classes) components, which enables uncertainty quantification and OoD detection for classification tasks.

“Confidently incorrect" is a common problem in deep classifier models that both the Bayesian neural network and deterministic uncertainty model approaches are able to address. [22] demonstrated that ReLU-based networks commonly used in deep learning “necessarily" produce overconfident predictions on OoD examples, but even “bad" Bayesian approximations help address this problem. This prompts the use of inexpensive approximate Bayesian uncertainty quantification approaches such as Laplace approximation [23]. Deterministic uncertainty model on the other hand tries to directly capture the "outlying-ness" of input data by transporting the outlier detection problem into a simpler latent space. Both approaches are able to alert the user when the input data is outside the domain for which the training data is able to provide useful evidence for classification.

Conformal prediction [24] is a distribution-free method that produces prediction sets from arbitrary class score estimators such that these prediction sets are guaranteed to satisfy a given probability coverage criterion. It is model-independent and widely applicable, and can serve as a general-purpose converter from non-calibrated class scores to calibrated predictions. However, while it equips discriminative classifiers with uncertainty quantification, it does not enable us to directly estimate \({\mathcal {P}}(X)\) or \({\mathcal {P}}(X|Y)\), which we need for detecting OoD instances. Still, since conformal prediction is a model-agnostic approach to predictive uncertainty quantification, it is possible to use it as a post hoc step to complement model-derived uncertainty in our approach. In our work, we include a conformal prediction evaluation step to assess the quality of model-provided inter-class uncertainty.

2.3 Label noise

Another important aspect of predicting with uncertainty is label class uncertainty, which is especially true for PVA detection in non-invasive ventilation, where we cannot easily establish ground-truth labels from high-quality auxiliary signals such as esophageal pressure [16].

Label noise refers to the inconsistencies between the factual labels y and the observed labels \({\hat{y}}\). In the most general case, the noisy label follows a data-dependent distribution \({\hat{y}} \sim {\mathcal {P}}({\hat{Y}}|Y, X)\), however in practice most methods for label noise mitigation makes a class conditional noise assumption [25]: \({\hat{y}} \sim {\mathcal {P}}({\hat{Y}}|Y)\), that is, the observed label \({\hat{y}}\) only depends on the underlying true label y. While this is certainly not true in reality, it is found to be reasonably effective in practice for many datasets [26]. For dealing with class-conditional label noise, it is important to know the class-conditional transition matrix \({\mathcal {P}}({\hat{Y}}|Y)\), which needs to be assumed known a priori or estimated with other means. In [26] the authors provided a technique to infer \({\mathcal {P}}({\hat{Y}}|Y)\) from the class probabilities of an existing discriminative classifier, assuming the class probabilities are reasonably informative. In [27] the authors proposed peer loss, a modification to the loss function to remove the need to infer \({\mathcal {P}}({\hat{Y}}|Y)\), however it still needs additional weight calibration for non-balanced classification problems. In [28] the authors proposed a method which is able to deal with both class-conditional and data-dependent label noise, provided that multiple label sources are available. Instead of modifying the discriminative loss function, it is also possible to directly build the noise generation process into the model itself. [29] proposed a noisy-aware classifier using a graphical model approach to improve image classification with both random and class-specific label noise.

Still, it is challenging to apply common label noise mitigation techniques to non-invasive PVA classification, mainly due to two reasons: highly imbalanced classes and potentially high class confusion rate. Class imbalance makes loss tweaking methods less effective [30], whereas high noise rate means for some classes, the supervision signal is very weak.

Nevertheless, as we shall demonstrate, if we can obtain an external estimate of label noise rate, then the class-conditional noise model is powerful enough to provide a significant uncertainty correction to a classifier.

3 Dataset and data preprocessing

3.1 Data description

The data for the study were collected from the polysomnographic (PSG) titration study of a cohort of 59 subjects during non-invasive ventilation (NIV). The original aim of the study was to determine the impact of PSG titration on treatment outcome factors such as sleep quality and PVA occurrence. In addition to standard ventilator outputs such as air pressure and flow, the dataset also includes readings from an effort belt device attached to the subjects’ thoracic and abdominal area, which can be used to approximate patient breathing effort [31]. The dataset also contains other relevant observations and study parameters, such as sleep stage scoring based on EEG. The dataset is manually annotated with PVA and sleep disruption event labels by a respiratory expert based on the PSG recordings. For PVA events, three types of events (autocycle, double trigger and ineffective effort) are identified based on the ventilator outputs (pressure, flow) and (thoracic and abdominal) effort belt readings. Based on the original labelling criteria [32], the Autocycle and Double Trigger labels can be merged into “double or multiple trigger event" due to similar labelling rules.

3.2 Data processing pipeline

Fig. 4
figure 4

Preprocessing workflow

In total, we have ventilation monitoring data from 59 subjects in two overlapping groups, with a total of 104 studies. Each study instance contains around 20–24 data channels, of which four channels are most relevant for PVA event detection: Mask Pressure (Pmask), Flow—Tx, Thoracic effort belt reading (Thor) and Abdominal effort belt reading (Abdo). We are also interested in the subjects’ sleep stages, as the main objective of the research is to help PVA event detection during sleep. These represent the direct readings from the ventilator, as well as data collected from sensor belts attached to the subject for breathing effort tracking. In preliminary data preprocessing, we integrate all data from different source files and down-sample all data channels to 32 Hz. We remove two study instances in this stage due to source file format incompatibility.

As most study instances contain sections where the ventilator was not connected or monitored, and there were other occasional interruptions to the study, we use a repeating pattern detection algorithm [33] on the Pmask channel to find sections of the data where the ventilator pressure waveform exhibits a regular repeating pattern, indicating the ventilator is working during these sections. We extract all such contiguous sections as valid data intervals. In this stage we further exclude two study instances due to having entirely invalid or missing data channels. Additionally, as the source data was collected for studying ventilation during sleep, the sleep segments are where the ventilator is supposed to be optimised for and more interesting to clinicians, whereas the non-sleep segments of the data contain significant amount of ventilation interruptions or disruptions (such as the patient changing position or interacting with clinical staff). For these reasons, we opt to retain only segments of the data where the subject is in any stages of sleep according to the sleep stage annotation.

Algorithm 1
figure a

Breath Segmentation

Fig. 5
figure 5

Phase estimation method

For each valid data interval, we apply a breath segmentation procedure based on state space reconstruction and rolling centroids. We find the 2D PCA representation of all length 10 subsequences of the Pmask data sequence, which forms an approximate state space embedding of the ventilator-lung system. As the system is approximately periodic, the state space embedding forms closed loops. We now define the phase value as the angle formed between the x-axis, the centre of the loop and the current state point in the embedding space. As the loop may drift around the embedding space during different stages of ventilation, we use rolling window centroids rather than the centre of the full dataset for calculating the phase angle (Fig. 5a). We unwrap the phase angles over time to obtain a continuous cumulative phase curve that increases over time. However, phase angles estimated in this way may regress in some parts of the data, adding noise to phase estimation. We therefore apply isotonic regression to the continuous phase curve to obtain a corrected, strictly increasing phase curve (Fig. 5b). It is then converted back to mod \(2\pi \) cycles, and we shift the phase so that the start of a breath is at approximately phase value 0. Finally, we identify the start of ventilator cycles by identifying when the phase curve crosses from below a threshold value to above it. This allows us to segment each valid intervals into individual ventilation cycles. Algorithm 1 provides an overall description of the segmentation algorithm.

We also noticed that in some study instances, the Thor and Abdo channels have reversed polarity, where the meanings of signal peaks and valleys are swapped. We identify these instances by finding negative magnitude correlations between Pmask and Thor or Abdo channels, and flip the affected channels to their correct polarity.

3.3 Processed data information

In Table 2, we provide a brief summary of key information regarding the processed dataset.

Table 2 Processed dataset information

4 Methodology

As demonstrated in our introduction, here we present our methodology based on three main challenges: signal representation; uncertainty modelling; label noise mitigation- and our proposed solutions, respectively: path signature features; flow-based generative classifier; class-conditional label noise correction.

4.1 Signal features selection

We have previously observed that common features used for sequence representation all have limitations when applied to PVA detection. But do these limitations really materialise in practice for our dataset? Do we need a combination of feature sets, or are there other feature construction methods that may suit our needs? In this subsection, we consider two strong contenders from previously discussed features, namely ventMAP’s handcrafted features and ROCKET from top-performing generic time series features. We also investigate a new contender, path signature features, which exhibit many interesting properties that are surprisingly useful for our purposes.

4.1.1 Hand-crafted features

These are the features used in the ventMAP model [2], which are primarily key point and shape descriptors of the waveforms. However, the original algorithm was intended for high-quality invasive ventilator data and only uses two of the four input channels present in our dataset.

4.1.2 Path signature features

We use path signature (specifically the log signature, a more concise representation) [34, 35] to represent both a single breath cycle and a contextual window of several breath cycles.

The path signature of a curve \({\textbf{X}}\) on [ab] is defined as [35]:

$$\begin{aligned} \begin{aligned}&Sig({\textbf{X}})_{[a, b]}=\\&\sum _{k=0}^{\infty }\int _a^b\int _a^{t_1}...\int _a^{t_{k-1}}d{\textbf{X}}_{t_1} \otimes d{\textbf{X}}_{t_{2}}...\otimes d{\textbf{X}}_{t_k} \end{aligned} \end{aligned}$$
(1)

where \(\otimes \) is the (truncated) tensor product. The signature consists of graded terms distinguished by a multi-index \({i_1 i_2 ... i_m}\), where each \(i_k \in [1, d]\cap {\mathbb {N}}\).

$$\begin{aligned} Sig({\textbf{X}})_{[a, b]}^{i_1 i_2 ... i_k}=\int _a^b\int _a^{t1}...\int _a^{t_{k-1}}d{\textbf{X}}^{i_1}_{t_1} d{\textbf{X}}^{i_2}_{t_{2}}...d{\textbf{X}}^{i_k}_{t_k} \end{aligned}$$
(2)

The length of the multi-index is called the depth or order of the signature term. In signature is an infinite-dimensional object, but in practice we often use the truncated signature \(\Pi _m Sig({\textbf{X}})\), defined as the signature up to depth m. We can justify such truncation by the factorial decay of the signature terms by depth [35].

In simplified terms, path signatures can be seen as polynomial features for multidimensional paths. Just as we can generate from a point (xy) a series of polynomial features \(x^1, y^1, xy, x^2, y^2, x^2y, \ldots \), We can for instance generate from a 3D path \(X_t\) signature terms identified by the indices \(i_1, i_2, i_3, i_1 i_1, i_1 i_2, i_1 i_3, i_1 i_2 i_3, i_2 i_1, \ldots \). Just like polynomials, this process creates a rich basis that is able to (linearly) approximate non-trivial functions over paths, which are also able to capture interactions between path dimensions.

The path signature itself contains redundant information, and some terms can be inferred from a combination of other terms through the shuffle product identity [34]:

$$\begin{aligned} S(X)^I_{a, b} S(X)^J_{a, b} = \sum _{K \in I \sqcup J} S(X)^K_{a, b} \end{aligned}$$
(3)

where \(\sqcup \) denotes the shuffle product.

The log signature allows us to compress the signature and remove such redundant information, resulting in a more concise representation. It is defined as:

$$\begin{aligned} \begin{aligned} LogSig({\textbf{X}})&:= Log(Sig({\textbf{X}})-{\textbf{1}})\\ {}&= \bigoplus _{n=1}^{\infty }\frac{(-1)^{n-1}}{n}(Sig({\textbf{X}})-{\textbf{1}})^{\otimes n} \end{aligned} \end{aligned}$$
(4)

While there exists many feature representation methods, such as summary statistics, Fourier coefficients, random convolutions [15] or end-to-end learned convolutions, most do not naturally capture the relationship between different data channels, and they usually require some form of padding or length normalisation to deal with variable-length sequence segments, which breath cycle data certainly are. On the other hand, path signatures naturally capture the relationship between multidimensional data, and can convert arbitrary length sequence input into comparable fixed-length vectors [35], making it an ideal feature representation for our use case. In our model, we use both time augmentation and time-delay augmentation \((x_t) \rightarrow (t, x_t, x_{t-\Delta t})\) before applying signature transform to weaken the reparameterisation invariance of signature features and help capture time-dependent traits in the data. A benefit of using signature features is that we can summarise a contextual window of a breath cycle (e.g. 10 cycles before and after the breath of interest) into the same length vector as the single cycle without average pooling or learned summarisation (e.g. via a recurrent neural network), which facilitates model implementation.

A nice property of the path signature is that continuous functions of a path can be approximated by a linear function of the signature to arbitrary precision [36]. This is because the signature terms serve as a polynomial basis for functions on the path space. This is a major theoretical justification for choosing the path signature as a generic sequence feature set. However, the number of signature terms grow exponentially with the depth (i.e. number of interactions between dimensions) we truncate the signature at, so in practice, we often take low truncate order signatures, and apply signature transform to smaller segments of the sequence data.

[37] recommends using signatures of sub-windows (such as dyadic windows) to capture features at finer scales instead of simply increasing signature depth, which we also consider in our feature set construction. Because our input data segments are variable-length, we construct proportionally dyadic windows from the inputs.

In Table 3, we see a summary of path signatures’ capabilities versus what we look for in a PVA detection feature set. While its practical effectiveness is yet to be empirically determined, it is clear that the innate properties of path signatures satisfy our requirements for PVA detection features. It is remarkable that we can simultaneously cover all four requirements with a single feature type that we can efficiently compute in one pass. We therefore hypothesise that signature features could be highly effective for describing ventilator waveforms.

Fig. 6
figure 6

Construction of dyadic windows and hierarchical contextual windows for path signature features. Note that the red dyadic windows describe the prediction interval, the blue overlapping windows describe the relationship between current window with the previous/next window, and the yellow context window captures the “background state" of the signal

Table 3 Path signature features versus feature requirements for PVA detection

4.1.3 ROCKET features

The ROCKET algorithm [15] uses dilated random convolution kernels to extract translation-invariant local information from time series and is shown to perform competitively in general time series classification tasks with favourable training time to performance ratio against the best time series classifiers. We may treat its output either as random shapelets or untrained convolutional neural network outputs. As the random convolution features are symmetrically aggregated over time, they are invariant to sequence length, and thus make suitable features for comparing variable-length input segments. The main drawbacks of ROCKET features are that the effectiveness of a particular feature set is random, depending on the convolution kernels chosen, and that for time series data with multiple channels, the number of features needed increases rapidly, but as the kernels are applied to individual channels, no cross-channel information is captured.

4.1.4 Feature set composition

We use classification performance with LightGBM [38] as an approximate measure of feature effectiveness, as gradient boosting decision tree models are generally found to be efficient and widely applicable to a variety of problems and feature spaces [39]. We use Optuna [40] to search for optimised hyperparameter settings for each classification scenarios. Furthermore, we obtain the optimised classification model for every feature combination of the three types of features.

The total dimensionality of each feature sets is 16 for hand-crafted, 1410 for path signatures, and 2000 for ROCKET.

We found that while combining all three sets of features achieve the best classification performance using LightGBM, the hand-crafted features provide only limited discriminative power, yet most of the performance can be achieved only using path signature features. As we see, path signature features are highly effective for the PVA detection problem. This confirms our earlier hypothesis that the capabilities of the path signature in Table 3 make it an ideal feature choice for our problem.

4.1.5 Feature set size

We also perform a feature subset study by comparing the performance of models built using the top n features based on LightGBM’s feature importance ranking with the all-features model (Fig. 7). We found that using around 800 top features can already achieve close to the best classification metrics, while the performance comes to a plateau with as little as 200 top features. Of the top 200 features, 13 are hand-crafted features, 105 are log signature features, and 82 are ROCKET features.

Fig. 7
figure 7

Classification metrics when selecting top K features according to LightGBM feature importance score, using combined ROCKET, hand-crafted and path signature features

Fig. 8
figure 8

Classification metrics when selecting top K features according to LightGBM feature importance score, using only path signature features

Table 4 Feature set combination comparison using LightGBM. Note that using signature features alone already achieves very similar performance as using all features

Considering the high effectiveness of path signatures, we also set up an experiment to study the optimal feature set size just for the path signature features (excluding the contextual window part) (Fig. 8). Interestingly, it takes only the first 250–300 features to achieve peak average precision while maintaining a high AUROC score. This suggests that for effective classifier learning, only a small set of features is needed to effectively characterise the observed data instances (Table 4).

4.1.6 Feature set source data

We also group the features into those based on local segments (the prediction interval and its immediate neighbours) as well as contextual windows (here set to the span of 10 breaths before and after the prediction interval) and compare the predictive performance of the LightGBM model with and without using contextual window information. The results are shown in Table 5. We see that combining local and contextual features provide a minor performance benefit to classification, even though contextual features by themselves are not helpful for accurately identifying PVA event.

Table 5 performance of LGBM model by feature source

4.2 Modelling method selection

Our overall goal is to build a classification model which is able to map a stream of observed inputs \(X=\{x_{0}, x_{1}, ..., x_{m}\}\) into a sequence of labels \(Y=\{y_0, y_1, ..., y_n\}\) assigned to intervals \((s_0, e_0), (s_1, e_1), ..., (s_n, e_n)\) of X. In the PVA event classification problem, X is the input multichannel signal, Y are the event class labels, and the intervals are the breath cycles. Note that the prediction intervals are neither evenly spaced nor fixed length. Instead, the prediction is over variable-length semantic segments. Generally, there are three strategies to formulate the information flow of the model:

  • Segment-wise classification: this is the simplest approach, where we treat each prediction interval as an individual unit, and build a classifier to map each segment to their corresponding classes.

  • Sequence-to-sequence (Seq2seq): this approach converts a data stream into sequences of small fixed length windows, and build a model to map each input window to an output prediction. The advantage is that we can choose model classes such as Hidden Markov Chain or Recurrent Neural Network to capture the temporal evolution of event classes. However, the output is with respect to fixed intervals rather than the given prediction intervals, requiring some form of post-processing.

  • Irregular sequence-to-sequence: some model classes such as neural differential equations (NDE) [41] allows us to take an input stream, evolve a hidden state continuously, and output predictions at arbitrary time points. This eliminates the main drawback of the Seq2Seq model, but this class of models are also the most difficult and time-consuming to train.

Apart from implementation details, the choice of different information flow strategies also dictates what types of events the models are most capable of modelling. A segment-wise model is most appropriate for strictly per-segment events, where we may treat event labels as properties of individual time segments. A seq2seq model excels at modelling events that may last for one or more consecutive time windows (i.e. “unit time" of the sequence). An irregular sequence model has the ability to model events with arbitrary starting and ending times. Strictly speaking, PVA events consist of all three types, where autocycle and double trigger events can either be seen as happening at the offending ventilation cycles, or as an event spanning consecutive offending ventilation cycles. Ineffective effort events last for a fraction of a single breath, covering only the section where an expected ventilator activation does not happen; however, we may also treat it as happening on the current ventilation cycle. In this study, since all PVA event types can be converted to per-segment (i.e. ventilation cycle) events, segment-wise classification provides the most straightforward solution.

In addition to the information flow, we would also like our model to be able to accurately quantify the uncertainty we ought to have regarding its predictions. The uncertainty comes from three sources:

  • Uncertainty of the observed data: the input signal may contain noise and disruptive events that are irrelevant for the classification task.

  • Uncertainty of the model: we only have limited data, and cannot perfectly capture the relationship between observed data and their labels.

  • Uncertainty of the label annotation: the human annotator may provide inconsistent labels for “hard" instances, or the labelling criteria may drift from instance to instance. We leave discussions on this topic to the next subsection.

Fig. 9
figure 9

Two types of uncertainty we would like to capture. The red lines represent the decision boundaries of the classifier. The blue circle is between two classes, and we are more interested in an informative class posterior probability \({\mathcal {P}}(Y|X)\), or model calibration. The orange circle is well outside the typical data distribution, and we are more interested in using the data distribution \({\mathcal {P}}(X)\) to identify it as an outlier (color figure online)

Fig. 10
figure 10

GC model architecture and graphical model. The left plot illustrates the building blocks of the deep conditional flow model in our generative classifier. The right plot shows variable dependencies between observed data X, latent representation Z, true class Y and observed label class \({\hat{Y}}\)

To quantify uncertainty in prediction \({\mathcal {P}}(y|x)\), we may either directly try to account for uncertainty in a discriminative model (e.g. via conformal prediction [24]), or capture the uncertainty through the Bayes rule in a generative model. However, for our use case, we are also interested in a normality measure of the observed data \({\mathcal {P}}(x)\) and \({\mathcal {P}}(x|y)\), which tells us whether a new observed instance is something we have seen in the typical training data, and whether an instance is a typical member of a given class. Such information is relevant clinically for identifying both “known anomalies" and “unknown anomalies" in the signal. For this purpose, a generative model based on variational inference or normalising flow provides the most straightforward solution.

Combining the observations above, we believe the best option for us is to build a feature-based segment-wise generative classifier.

4.2.1 Generative classifier model

As per earlier discussion, we would like a model which can not only output class probabilities \({\mathcal {P}}(y|x)\), but also produce data probabilities \({\mathcal {P}}(x)\) and class conditional data probabilities \({\mathcal {P}}(x|y)\), for which a generative classifier is the appropriate choice. A flow-based generative classifier [7] can not only model the generative process of arbitrary data distributions, but also allows for controlled balancing between classification and generative modelling performance, enabling class labelling and data probability estimation both with and without conditioning on class.

In our model, the observed data X is a variable-length segment of the multichannel signal, consists of three consecutive breathing cycles, and the context window of X, CTX, is also a variable-length segment consisting of 21 breathing cycles. We use the log signature as the representation for both X and CTX due to its empirical effectiveness shown earlier, and its ability to produce comparable feature sets from variable-length inputs.

The notations used in the model and overall training objective are as below, and the overall architecture and the graphical model are shown in Fig. 10.

4.2.2 Notations

  1. 1.

    \({\mathcal {D}}\): Distribution of data

  2. 2.

    X: Random variable (RV) for data

  3. 3.

    Y: RV for true labels

  4. 4.

    \({\hat{Y}}\): RV for observed labels

  5. 5.

    Z: RV for latent representation of X

  6. 6.

    \(x, y, {\hat{y}}, z\): an instance of X, Y, \({\hat{Y}}\), Z, respectively

  7. 7.

    \(\theta \): model parameters

  8. 8.

    \(f_{\theta }\): deterministic map from X to Z

  9. 9.

    \(f^{-1}_{\theta }\): inverse map from Z to X

  10. 10.

    \(|J_{x \rightarrow z}|\): absolute value of the Jacobian determinant of \(f_{\theta }\), from x to z, i.e. \(|det \frac{\partial z}{\partial x}|\)

4.2.3 Objective

We may express the objective of the model as maximising the following expression, interpreted as maximising the probability of observing dataset \((X, {\hat{Y}})\) under model assumptions:

$$\begin{aligned} log {\mathcal {P}}(X, {\hat{Y}} ; \theta ) \approx {\mathbb {E}}_{(x, {\hat{y}})\sim {\mathcal {D}}} log {\mathcal {P}}(x, {\hat{y}} ; \theta ) \end{aligned}$$
(5)

4.2.4 Quantities of interest

Ultimately, we would like to know \({\mathcal {P}}(Y|X, \theta )\), \({\mathcal {P}}(X|\theta )\) and \({\mathcal {P}}(X|Y, \theta )\). They correspond to the model prediction of a data instance label, the “abnormality" of an observed data instance, and class-conditional likelihood of observing a data instance. These quantities are all important for uncertainty-aware decision-making. \({\mathcal {P}}(Y|X, \theta )\) informs us the model’s best estimates for event classes. \({\mathcal {P}}(X|\theta )\) is needed to identify instances that are so rare that we do not have sufficient evidence for classification, or are anomalous events not accounted for by the event labels. \({\mathcal {P}}(X|Y, \theta )\) on the other hand helps answer the hypothetical question “can this instance possibly be from class k?", which informs us about model confidence within each event class.

The loss function used for model training in the noise-free case is similar to that of [7], which we derive below.

4.2.5 Derivation of the loss function

The loss function, assuming no label error, can be expressed as:

$$\begin{aligned} \begin{aligned}&L_{clean}(X, Y) = - {\mathbb {E}}_{(x, y)\sim {\mathcal {D}}} log {\mathcal {P}}(x, y) \\ =&- {\mathbb {E}}_{(x, y)\sim {\mathcal {D}}} log {\mathcal {P}}(y|x){\mathcal {P}}(x) \\ =&- {\mathbb {E}}_{(x, y)\sim {\mathcal {D}}} log \frac{{\mathcal {P}}(x|y ){\mathcal {P}}(y)}{{\mathcal {P}}(x)}{\mathcal {P}}(z)|J_{x \rightarrow z}| \\ =&- {\mathbb {E}}_{(x, y)\sim {\mathcal {D}}} log \frac{{\mathcal {P}}(z|y)|J_{x \rightarrow z}|{\mathcal {P}}(y)}{{\mathcal {P}}(z)|J_{x \rightarrow z}|}{\mathcal {P}}(z)|J_{x \rightarrow z}| \\ =&- {\mathbb {E}}_{(x, y)\sim {\mathcal {D}}} [log {\mathcal {P}}(z|y)+ log{\mathcal {P}}(y)-log{\mathcal {P}}(z) \\&+log{\mathcal {P}}(z)+log|J_{x \rightarrow z}|] \\ \end{aligned} \end{aligned}$$
(6)

In this formulation, we are able to separate the loss terms to \({\mathcal {P}}( y | x)\) (discriminative term) and \({\mathcal {P}}(x)\) (generative term) and assign different weights to them. The discriminative loss term is responsible for forcing different class clusters apart, whereas the generative loss term ensures the model fits the observed data.

4.2.6 Conditioning on context

One drawback of the model formulation above is that we are assuming all subjects and all observation periods follow the same data distribution all the time. This is inconsistent with the observed high variability between waveforms from different studies. To remedy the issue, we either have to assume the mixture component parameters depend on the context of a breath, or make the data depend on the contextual window: \(x \sim {\mathcal {P}}(X|Y, CTX)\). The former is the idea of mixture density networks [42], which requires a separate proposal network for mixture parameters. The latter can be implemented much easier with a conditional invertible network [43], which is what we choose in our model.

If the dependency of the model on the context input is too high, e.g. \({\mathcal {P}}(X|Y, CTX) \approx {\mathcal {P}}(X|CTX)\), then the generative classifier model is no longer valid, as X would only be slightly dependent on the class Y or latent representation Z. Therefore, we must limit the information that can be passed from the context window to X or Z. We can achieve so by introducing a very narrow bottleneck (i.e. network layer with very few neurons) in the network architecture between the context window subnetwork and the \(Z \leftrightarrow X\) main network.

4.2.7 Backbone of the model

We choose to use a combination of neural spline flows (NSF) [44] and coupling blocks [21] to implement the bijective mapping from X to Z. NSF has an autoregressive architecture, and is one of the more expressive invertible neural network building blocks, whereas coupling block is one of the most efficient, but less expressive. By combining these two building blocks, we strive to find a balance between efficiency and modelling capacity. We implement our GC models with FrEIA [45] and Zuko [46] in PyTorch [47].

4.3 Dealing with label noise

We now consider the case where the labels are affected by class-conditional noise. We assume that the observed labels only depend on the true labels and not on the data, i.e.

$$\begin{aligned} {\mathcal {P}}({\hat{Y}}|Y, X) = {\mathcal {P}}({\hat{Y}}|Y) \end{aligned}$$
(7)

This makes \({\hat{Y}}\) and X conditionally independent given Y.

Now consider the case where we can only observe \((X, {\hat{Y}})\), then the loss function for maximising observed (noisy) data probability can be expressed as:

$$\begin{aligned} \begin{aligned}&L_{noisy}(X, {\hat{Y}}, \theta ) = -{\mathbb {E}}_{(x, {\hat{y}})\sim {\mathcal {D}}} log {\mathcal {P}}(x, {\hat{y}}) \\ =&-{\mathbb {E}}_{(x, {\hat{y}})\sim {\mathcal {D}}} log \sum _{y} {\mathcal {P}}(x, {\hat{y}}, y) \\ =&-{\mathbb {E}}_{(x, {\hat{y}})\sim {\mathcal {D}}} log \sum _{y} {\mathcal {P}}({\hat{y}}|y){\mathcal {P}}( y | x){\mathcal {P}}(x) \\ =&-{\mathbb {E}}_{(x, {\hat{y}})\sim {\mathcal {D}}} log \\&\sum _{y} {\mathcal {P}}({\hat{y}}|y)\frac{{\mathcal {P}}( z | y )\cdot |J_{x \rightarrow z}| \cdot {\mathcal {P}}(y)}{{\mathcal {P}}(z)\cdot |J_{x \rightarrow z}|}{\mathcal {P}}(z)\cdot |J_{x \rightarrow z}| \\ =&-{\mathbb {E}}_{(x, {\hat{y}})\sim {\mathcal {D}}} log \sum _{y} exp(log{\mathcal {P}}({\hat{y}}|y) \\&+ log{\mathcal {P}}( z | y ) + log{\mathcal {P}}(y) - log{\mathcal {P}}(z) \\&+ log{\mathcal {P}}(z) + log|J_{x \rightarrow z}|) \\ \end{aligned} \end{aligned}$$
(8)

However, this objective is difficult to optimise due to the \(log-sum-exp\) step. Unlike in \(L_{clean}\), we cannot easily separate the generative vs discriminative losses, and the optimisation is numerically unstable. Instead, we may wish to optimise an upper bound of \(L_{noise}\) via a sampling step:

$$\begin{aligned} \begin{aligned}&L_{noisy}(X, {\hat{Y}}, \theta ) = -{\mathbb {E}}_{(x, {\hat{y}})\sim {\mathcal {D}}} log {\mathcal {P}}(x, {\hat{y}}) \\ =&-{\mathbb {E}}_{(x, {\hat{y}})\sim {\mathcal {D}}} log \sum _{y} {\mathcal {P}}(x, {\hat{y}}, y) \\ =&-{\mathbb {E}}_{(x, {\hat{y}})\sim {\mathcal {D}}} log \sum _{y} \frac{{\mathcal {P}}(y|{\hat{y}}){\mathcal {P}}({\hat{y}})}{{\mathcal {P}}(y)}{\mathcal {P}}(x, y) \\ =&-{\mathbb {E}}_{(x, {\hat{y}})\sim {\mathcal {D}}} log {\mathbb {E}}_{y \sim {\mathcal {P}}(Y|{\hat{Y}})} \frac{{\mathcal {P}}({\hat{y}})}{{\mathcal {P}}(y)}{\mathcal {P}}(x, y) \\ \le&-{\mathbb {E}}_{(x, {\hat{y}})\sim {\mathcal {D}}}{\mathbb {E}}_{y \sim {\mathcal {P}}(Y|{\hat{Y}})} log \frac{{\mathcal {P}}({\hat{y}})}{{\mathcal {P}}(y)}{\mathcal {P}}(x, y) \\ =&{\mathbb {E}}_{(x, {\hat{y}})\sim {\mathcal {D}}}{\mathbb {E}}_{y \sim {\mathcal {P}}(Y|{\hat{Y}})} -log {\mathcal {P}}({\hat{y}})+log{\mathcal {P}}(y)-log{\mathcal {P}}(x, y) \\ \end{aligned} \end{aligned}$$
(9)

Note that if we directly infer \(P({\hat{Y}})\) from data and assume \(P({\hat{Y}}|Y)\) is known, then we can readily obtain P(Y) and \(P(Y|{\hat{Y}})\). There are no trainable parameters in \(-log {\mathcal {P}}({\hat{y}})+log{\mathcal {P}}(y)\), so they can be dropped from the loss function. This allows us to train the model with the same loss function as in the clean label case, just with the additional sampling step \(y \sim {\mathcal {P}}(Y|{\hat{Y}})\).

With this model, we are able to estimate the true label of a given data instance:

$$\begin{aligned} \begin{aligned} {\mathcal {P}}(y|x, {\hat{y}}) =&\frac{{\mathcal {P}}(x, {\hat{y}}|y){\mathcal {P}}(y)}{\sum _{y'}{\mathcal {P}}(x, {\hat{y}}|y'){\mathcal {P}}(y')} \\ =&\frac{{\mathcal {P}}(x|y){\mathcal {P}}({\hat{y}}|y){\mathcal {P}}(y)}{\sum _{y'}{\mathcal {P}}(x|y'){\mathcal {P}}({\hat{y}}|y'){\mathcal {P}}(y')} \\ \propto&{\mathcal {P}}(z|y){\mathcal {P}}({\hat{y}}|y){\mathcal {P}}(y) \\ \end{aligned} \end{aligned}$$
(10)

In practice, we use the loss function upper bound in Eq. 10 for model training and use the exact loss in Eq. 8 for evaluation.

4.3.1 Estimating label confusion

To successfully implement our model, it is necessary to have knowledge of label confusion rates, or the label transition matrix \({\mathcal {T}}(Y)={\mathcal {P}}({\hat{Y}}|Y)\). Unfortunately this information cannot be reliably inferred from data alone. Ideally, we would need a “clean" validation set to estimate label confusion rates, but in this case we lack such a reference dataset. Instead, we combine prior assumptions with an existing reliability study to arrive at a reasonable value for \({\mathcal {P}}({\hat{Y}}|Y)\).

4.3.2 Assumptions

In the original study paper [32], the annotator performed a label consistency study by repeating the same labelling task more than two weeks apart. It was found that while double trigger/multiple trigger events were mostly consistently labelled, there exist more inconsistencies in the ineffective effort (IE) class. Therefore, it is reasonable to assume that the multiple trigger (MT) class have very small confusion rates with the other classes. Moreover, our correspondence with a respiratory expert suggests it is reasonable to assume a trained annotator is more likely to miss PVA events (false negative) than mistakenly label normal breaths (false positive). Therefore, we restrict \({\mathcal {P}}({\hat{Y}}|Y)\) to the form in Table 6.

Table 6 Label transition matrix

This way, we only have to estimate two single confusion rates pq for IE \(\rightarrow \) Normal and MT \(\rightarrow \) Normal, respectively.

4.3.3 Infer confusion rate from reliability study

In [32], the original annotator performed an intra-rater reliability test of PVA scoring by comparing the hourly asynchrony event rate of the same ten PSGs from two scoring (annotating) procedures. The discrepancy between asynchrony rates for IE range from 0.6 to 36.2%, with an average of 17.9%. We may treat this as a conservative estimate for the confusion rate p. Similarly, the average discrepancy between asynchrony rates for MT is 4.6%, with the per-PSG value ranging from 0 to 15.9%. If we consider the total number of each asynchrony events detected across PSGs, then at least 1.9% of MT events and 5.7% of IE have inconsistent labels. Note that this outcome also suggests that the confusion rate is likely patient-dependent; however, we do not have sufficient data for a per-patient confusion rate model; therefore, we still resort to the class-conditional label noise model as described in the assumptions.

Based on the above analyses, we believe it is reasonable to assume \(p \approx 5.7\%\) and \(q \approx 1.9\%\) for the noise-aware generative classifier model.

5 Experiments

In our experiments, we evaluate the classification and anomaly detection performance of the generative model on the PVA events dataset, with an emphasis on uncertainty quantification. We use the LightGBM model based on all features in the feature comparison section plus a calibrated version of the model as our baseline and compare their performance to the generative classifier model with and without label noise correction. For completeness, we also compare a version of the generative classifier model using only log signature features versus one that uses the top hand-crafted, log signature and ROCKET features.

5.1 Model evaluation

The models are evaluated in three distinct aspects: raw classification performance via F-1 score, AUROC and Average Precision, model calibration via expected calibration error per class and conformal prediction set size vs class coverage, as well as anomaly/OoD detection via agreement with alternative anomaly detection schemes.

Another important aspect of classification performance is whether the model predicts similar asynchrony rates (events per unit time) for every subject as that of the annotation. A more reliable model should predict more consistent asynchrony rates as the human expert, and have smaller performance discrepancies across subjects. We measure such performance consistency by computing the correlation, Spearman’s R and mean absolute error between the asynchrony rates from model prediction and human annotation, on the test dataset.

5.1.1 Conformal prediction set for calibration evaluation

Here, we provide some additional details on applying conformal prediction [24, 48] to assess the quality of model-provided inter-class uncertainty. Conformal prediction is a technique which takes a conformity scoring function s(xy) which assigns a real-valued score to each possible prediction, and finds a prediction set selection procedure based on a calibration dataset such that the prediction set is guaranteed to include the true outcome above a probability threshold [48].

Here, we set the scoring function to simply the softmax probability score of the true class. On the calibration set, we rank the conformity score \(s_i\) of every data instance \((x_i, {\hat{y}}_i = {\hat{f}}(x_i))\), find the \(\alpha \)-th quantile of the conformity score distribution q, and on a separate validation set, we find the prediction set \(C_j\) of each data point \(x_j\) by calculating the softmax scores of each class, and including all classes for which the score is above the threshold q. This procedure ensures that on the validation set, the prediction sets \(C_j\) will always marginally cover (i.e. unconditioned with respect to data x or class y) the true label \(y_j\) with probability at least \(1-\alpha \). We may then treat the size of the prediction sets as a measure of the amount of uncertainty in the model, and treat the class-conditional coverages as a measure of the balance of uncertainty across classes. In our experiments, we set \(\alpha =0.05\).

5.1.2 OoD measure evaluation

We would like to evaluate the effectiveness of the model’s \({\mathcal {P}}(X)\) estimate as a normality/OoD measure. As we lack a pre-defined notion of normality on the PVA dataset time series, we evaluate the effectiveness of this normality measure by comparing anomaly detection results from model output against results from standard time series anomaly detection techniques. In time series literature, discords (subsequences that are most distant from their nearest neighbours) are often used as a proxy for general anomalies, and discord mining has been shown to be effective at various practical anomalous subsequence mining tasks, and can be identified through the matrix profile of a time series [49]. Here, the detected discords are with respect to the current instance of time series. In contrast, the data probability scores from the generative classifier model measures normality with respect to the modelled data distribution, and is computed directly from a contextual window of interest without reference to the full time series.

Since we do not have a ground truth anomaly label, we design an auxiliary objective based on the fact that certain PVA and non-PVA events often correspond to more volatile behaviour in the signal, and are more likely to overlap intervals which may be considered anomalous. We designate MT and arousal events as proxy indicators for abnormality, and test whether we can construct a cover set around time points with the highest anomaly scores. We construct the cover set by first finding all time indices with anomaly scores above the q-th quantile, and then placing an interval of radius r around each such indices. We evaluate the quality of the cover set by counting the percentage of overall proxy indicators that are covered by it. This allows us to compare the characteristics of \({\mathcal {P}}(X)\)-based anomaly score with baseline methods such as aggregated matrix profiles.

5.2 Hyperparameter selection

There are two important hyperparameter choices for the generative classifier model. We need to decide whether to set the prior distribution of the event classes to the empirical label distribution. In preliminary studies we found that with the large class imbalance in our dataset, setting an imbalanced prior class distribution diminishes the classification performance on non-normal classes too much. We also have to set the balance coefficient \(\gamma \) for the trade-off between generative and discriminative losses [7]. We include a separate study in our experiments to examine its effect on the classification results as well as the quality of latent space visualisation.

We also considered applying label smoothing to encourage class overlap when necessary, as suggested by [7], but we found it to be ineffective. We suspect that because label smoothing is roughly equivalent to assuming symmetric label noise, it affects different classes unevenly when the classes are not balanced, exaggerating the rate at which the majority classes flip to minority classes. As the PVA dataset has high class imbalance and inherent label noise, we believe label smoothing is not an appropriate strategy.

5.3 Experiment results

Fig. 11
figure 11

Example of GC prediction output

5.3.1 Classification performance

In Table 7, we see the PVA classification performance of baseline LightGBM, a pure discriminative model and generative classifier models. Overall, on pure discriminative metrics, the LightGBM and the pure discriminative (\(\gamma =\infty \)) models perform better than the generative classifier models, and the GC models with higher \(\gamma \) generally perform better on most metrics. GC models with high \(\gamma \) can get close to or surpass the LightGBM model on some classification performance metrics. By looking at the classification metrics alone, it is not clear whether label noise correction is worth it, as the GC models without label noise correction appear to have higher performance scores. However, as we shall see below, label noise correction greatly improves model uncertainty calibration.

Table 7 Run-averaged classification performance comparison, for LightGBM baseline and GC models at different \(\gamma \) values, with and without label noise correction

We also look at predictive consistency across subjects. In Tab. 8, we notice that GC-based models exhibit higher consistency between label and predicted asynchrony rates, and tends to predict similar “severity" for the PVA events as in the human annotation, whereas the LightGBM model tends to have larger performance disparity between subjects, and is more likely to predict asynchrony rates that do not agree with the annotation. Indeed, the LightGBM model tends to de-prioritise learning hard instances in favour of fine-tuning performance on easier instances, resulting in larger deviations from annotation on several of the most challenging PSG studies.

Table 8 Classification consistency across subjects comparison, through correlation and MAE scores between annotation versus model asynchrony rates per subject

5.3.2 Model calibration

For model calibration comparison, we will use the “corrected" noisy label probability instead of estimated “true" label probability for the GC models, because we are only able to observe the noisy labels and evaluate the models based on consistency with such observed labels. In Table 9 and Fig. 12, we see that the GC model with noise correction achieves the best class probability calibration out of all models. In contrast, the GC models with no label correction exhibits poorer calibration, especially for the normal and IE classes. The baseline LightGBM model is more calibrated than most low \(\gamma \) GC models but less so than noise-aware GC models with ideal generative-discriminative balance.

Table 9 Run-averaged expected calibration error comparison
Fig. 12
figure 12

Calibration plot for the best GC and LightGBM models, for each of the event classes (Normal, Multiple Trigger, Ineffective Effort) from left to right. Curves above the diagonal indicate over-confidence, whereas curves below the diagonal indicate under-confidence. Methods with calibration curves closer to the diagonal are more calibrated

We may also analyse model uncertainty calibration through the scope of conformal prediction, as shown in Table 10. We see that for the same marginal coverage level (95%), GC models tend to need the same or slightly larger prediction sets than the LightGBM baseline, but can achieve much higher coverage on PVA events (MT and IE labels). Likewise, we also notice that with fixed marginal coverage, per-class coverage balance tends to decrease with larger \(\gamma \) value (i.e. discriminative loss weight), indicating that it is the generative loss that contributes more to balanced model calibration.

Overall, we notice that unlike typical discriminative classifiers based on deep learning, our GC-based model is able to achieve similar or better model calibration than a well-tuned LightGBM model.

Table 10 Conformal prediction set size, class coverage and marginal coverage (\(\alpha =0.05\))

5.3.3 Data probability score and latent space visualisation

In Fig. 13, we see the percentage coverage of anomalous points at different covering interval radius and the quantile cutoff we use to construct the cover set, using both the \({\mathcal {P}}(X)\) score and breath-aggregated matrix profile. Clearly, as we increase quantile and expand the interval radius, larger portions of potentially anomalous points will be covered, but we can also see that \({\mathcal {P}}(X)\) score manages to capture more of the anomaly indicators at smaller quantile values and narrower interval radii. This indicates that by using \({\mathcal {P}}(X)\) as the anomaly score, more of the high score points are closer to real anomalies than simply using aggregated matrix profile. We believe this is mainly due to the GC model being able to integrate waveform distribution information from multiple PSGs, whereas instance-based anomaly scores like MP only look at intra-instance abnormality. By having a generative model of X, we also capture a more complete picture of a given input x’s typicality, beyond local information such as kNN distances, which methods like MP rely on.

Fig. 13
figure 13

Coverage of anomalous points by GC probability score and matrix profile at different covering interval radius and score quantile. Brighter colour indicates higher anomalous coverage at a given sensitivity level in the first two plots, and greater anomalous event detection rate advantage for the generative classifier in the last plot

We can also obtain an intuitive visual representation of the latent space Z by simply projecting it onto a 2D plane defined by the three class centroids. We are fortunate in this case to have exactly three classes, but even with larger number of classes, projecting the latent space with PCA is still straightforward and yields similar results. The projected latent space with different discriminative loss weight \(\gamma \) can be seen in Fig. 14. As we see, at lower \(\gamma \) levels, the representation more closely resemble a Gaussian mixture, class overlaps are preserved, and the disparity between training set and test set is smaller. On the other hand, with high \(\gamma \) value, the data representation is more stretched out, and different classes are pushed apart, but the clusters are also more deformed / less Gaussian, and we often end up introducing more overfitting artefacts, such as clean, large margin boundaries between classes that are only visible in the training set. Because the latent space distribution conforms more to the Gaussian mixture model with lower \(\gamma \), out-of-distribution detection is more effective with higher generative loss weight.

Fig. 14
figure 14

Latent space visualisation for different \(\gamma \) values. Red dots indicate the means of fitted Gaussian mixture components

In Fig. 15, we look at the data probability scores (\(-log{\mathcal {P}}(X)\)) in the latent space projection. We notice that most latent space points near the middle of class clusters have low score, whereas those far from the distribution of typical data tends to have high score. Note that while the latent space representation approximately captures the class membership and typicality of the input data X, some data distribution information is encoded in \(-log{\mathcal {P}}(X)\) through the \(log|J_{x \rightarrow z}|\) term in the loss function, which reflects the probability “warping" when going from data X space to latent Z space. This information is not directly available in the latent representation.

5.3.4 Effect of different discriminative loss weight \(\gamma \)

The studies in [7] suggest that setting different \(\gamma \) values results in a family of models with varying emphasis on classification accuracy versus data modelling accuracy. Based on earlier observations, this is indeed the case. In Table 7, we see that the classification performance in terms of F1 scores generally increases with higher weight on the discriminative loss, but AUROC and Average Precision score do not have this trend, and in the case of AUROC, it is actually highest with the smallest \(\gamma \) value.

We also notice a similar pattern in the latent representation. With larger \(\gamma \), decision boundary clarity increases, but the model’s ability to capture data distribution \({\mathcal {P}}(X)\) decreases. In terms of uncertainty quantification, this means while the class probabilities \({\mathcal {P}}(Y|X)\) are more accurate with higher discriminative loss weight, the uncertainties in data likelihood \({\mathcal {P}}(X|Y)\) are better controlled with higher generative loss weight. There exists an uncertainty quantification performance trade-off, as mentioned in [7], but we are able to fine-tune the balance as needed.

5.3.5 Summary of experiment results

Overall, in our experiments we observed that although the generative classifier model sacrifices a small amount of pure discriminative performance metrics when compared with a strong baseline—the LightGBM classifier (Table 7)—the generative classifier model is both more consistent across patients (Table 8), and better calibrated (Table 9, Fig. 12) than the baseline at the appropriate \(\gamma \) choice (\(\gamma \approx 4\)). GC models also exhibit superior conformal prediction set coverage on the PVA classes than the baseline LightGBM model (Table 10), indicating they produce more informative class probabilities. In conclusion, our GC models are able to produce data probability scores that are unavailable in purely discriminative models. Such scores, when used for anomaly detection, are able to outperform a strong and widely used baseline for sequence anomaly detection, the matrix profile (Fig. 13).

6 Discussion

6.1 Feature representation choice

We have demonstrated in study that path signature features with dyadic windows and overlapping contextual windows can be an effective choice for feature representation for the PVA detection problem, and potentially also for similar multichannel interval classification problems. We believe the effectiveness of the signature features can be attributed to many of its advantages being directly relevant for the PVA problem and similar medical monitoring time series. Firstly, signature features can be calculated over variable length windows, which naturally provides a consistent way to summarise prediction intervals of different lengths and keep the features comparable. Secondly, path signature features encode interactions between data channels through the iterated integral terms, which mixes information from distinct channels. In contrast, many commonly used generic time series features apply to one channel at a time, missing these important interactions. And finally, path signatures encode the order of events [35], which is particularly relevant for our objective, detecting desynchronisation between data channels. We have also seen that although the full signature feature set size can grow rapidly with data dimension, it is possible to select a sparse subset of the full signature and still achieve high classification performance. Considering the other advantages of the path signature, such as computation efficiency (vs. many elaborate time series features) and being fully deterministic (vs. stochastic feature sets such as ROCKET features), it is a highly competitive default choice for sequential data representation.

Fig. 15
figure 15

\(-log{\mathcal {P}}(X)\) values in the latent space

6.2 Advantage of the generative classifier

While most research in uncertainty quantification of classifiers focus on the class posterior probabilities \({\mathcal {P}}(Y|X)\), one can argue that the data likelihood \({\mathcal {P}}(X|Y)\) is equally important, as there may exist instances where an observed instance does not belong to any of the pre-defined classes, is impossible to classify based on observed data alone, or exhibits significantly shifted behaviour than the training set. Having access to \({\mathcal {P}}(X|Y)\) allows us to perform Out-of-Distribution (OoD) detection directly with the model. A flow-based generative classifier conveniently provides access to probability estimation in both directions. In our experiments, we have shown that the “generative" part of the model enables us to obtain a reasonable estimate of the abnormality of the prediction interval without having to perform instance-based similarity comparison at test time. The model encodes the relationship of a potential new observation to instances already in the training set in the generative model \({\mathcal {P}}(X)\). We demonstrated in the experiments that using \({\mathcal {P}}(X)\) to provide a measure of abnormality generally produces consistent results as using sliding window-based abnormal subsequence mining.

The generative classifier approach also provides us with access to an easily interpretable latent space representation, because it tries to model the complexities in the observed data by reducing it to a simple (e.g. Gaussian mixture) distribution where only the class memberships are encoded, whereas all the complexities are hidden away in the learned mapping function. The latent space representation provides us with exactly the information we need regarding classification: class membership strengths, class overlaps, class-conditional abnormality, etc.

6.3 Label noise handling

In this study, we chose the most widely used class-conditional label noise model to capture inherent data uncertainty due to inconsistent label annotation. Even though the modification to the noise-free model is minor, we still observed a large impact of the label transition process on the calibration quality of final predictions and the truthfulness of latent space representation. The existence of the label transition process allows for some instances to have ambiguous class probability predictions, and for the latent representations of different event classes to have some overlap, without incurring large loss penalty. Without the label transition process, the model is mis-specified, and cannot reliably capture the uncertainty it should have for its predictions. This suggests that there is great value in conducting reliability study when annotating data with ambiguous classes or with high inter- or intra-annotator variance. Even if the labels themselves are not fully reliable, a reasonable estimate of the class transition matrix can greatly improve uncertainty quantification regarding model prediction based on the unreliable labels.

Class-conditional noise (CCN) still does not fully reflect the reality of class ambiguity in many real-world datasets, as can be seen in the annotator reliability study for the PVA dataset, where the label flipping rate changes drastically from PSG to PSG. To further improve label noise handling, we may need to take into consideration instance-dependent noise (IDN). This problem is difficult in our case with only a single annotator and extreme class imbalance, but in more favourable situations, effective strategies exist for learning with IDN. In the multi-annotator case, it is possible to learn a possible label distribution for each instance. With more balanced classes, it is possible to learn with CCN / IDN by modifying the loss function [27, 50], or by employing a multimodal approach [51].

6.4 Other limitations and further research

The current model does not take into account the sequential relationship between adjacent input instances, apart from the neighbourhood information provided by the signature features. However, such temporal relationship may provide additional useful information for event detection (e.g. a first occurrence of an asynchrony event may increase the likelihood of observing a repeat event soon after). Indeed, sequence-to-sequence and continuous time models are commonly studied in similar problems such as ECG or EEG event classification. It is worth investigating whether a sequence-based model would provide sufficient benefits compared to the current segment-based model. Also worth noting is that the path signature features we used in the current model naturally provides ways to tokenise high-frequency sequences into lower-frequency states, which can prove beneficial for both Seq2seq and continuous-time models.

Another limitation of the current model is that there is only one latent distribution, which is shared across all PSGs. Because all patients’ data have to be generated from the same model, some patients will end up with systematically lower \({\mathcal {P}}(X)\) values than others. This leads to somewhat imbalanced anomaly detection sensitivities across patients. In reality, what we really want to know is instead \({\mathcal {P}}(X|patient)\), or the normality of an observation with respect to data from that patient. For obvious reasons we may not want to build models that must be trained on and applies to one patient only; therefore, a possible solution is to make the latent distribution multi-modal per class, e.g. use multiple mixture components per event class to represent different types of patients.

Generative modelling is an active area of research, and there have been several important developments in recent years, notably the rise of diffusion-based models [52]. It is possible to generalise the idea of diffusion-based model to generative classifiers [53], which may provide additional modelling capability to this class of models.

Conformal prediction provides a general black-box solution to uncertainty quantification in regression and discriminative classification problems; however, its application in OoD detection is limited by some models’ tendency to be “confidently incorrect" for instances far from the decision boundary. Combining a generative classifier model with conformal prediction may provide a solution to robust and theoretically sound uncertainty quantification for both the inter-class and OoD case.

7 Conclusion

In this research, we performed a thorough analysis of data representation, data modelling and uncertainty quantification methods for a patient-ventilator asynchrony detection problem. Our feature analysis points to path signatures as the superior feature set for breath segment representation. Based on the signature features, we built a latent Gaussian mixture generative classifier with noisy label correction and showed that such a model is able to achieve similar classification performance while maintaining higher classification consistency and model calibration than the baseline gradient boosting model. We also demonstrated that compared to discriminative classifiers, the GC model can additionally provide anomaly detection capabilities as well as a simple visualisation of the latent representation. By combining the richness of the feature representation and the data modelling capability of the GC, we are able to extract as much useful information as we can despite the challenges posed by the multiple types of uncertainties that exist in the PVA data. The effectiveness of the GC model especially with noisy label correction suggests that generative classifiers as a model class has an advantage versus discriminative models in its ability to incorporate and propagate known and estimated uncertainties, producing more reliable predictions.