1 Introduction

More than 50 years ago John West published his landmark model of pulmonary gas exchange [1], building on the work of predecessors [2]. The model is characterised by volumes of inspired gas (V) and mixed venous blood (Q) equilibrating in 10 to 100 virtual lung compartments governed by log normal distributions of alveolar ventilation and pulmonary capillary blood flow across compartmental V/Q ratios [1, 3, 4].

The multiple inert gas technique (MIGET), an investigative tool based on West’s model [5, 6], has provided mechanistic detail on impaired gas exchange. MIGET evaluations are technically challenging procedures in which six inert gases spanning a range of solubilities are infused in saline until equilibration. Plots of pulmonary retention and excretion versus gas solubility are constructed from gas chromatographic measurements and ‘transformed’ respectively into distributions of blood flow and ventilation against a logarithmic scale of V/Q ratios spread across 50 compartments [5, 7].

MIGET has identified shunt (V/Q = 0) as the dominant cause of hypoxaemia in the acute respiratory distress syndrome (ARDS) and lobar pneumonia, whereas in chronic obstructive pulmonary disease (COPD) and in some patients with COVID-19 pneumonia hypoxaemia is primarily from mixed venous equilibration in low V/Q compartments [8,9,10]. Bimodal distributions have been observed in patients with COPD, asthma [3] and ARDS [11].

Despite its ‘gold standard’ status, the complexity of MIGET has obliged clinicians to track pulmonary gas exchange via alternative indices, usually those categorized as ‘tension’ or ‘content’-based [12]. Venous admixture (VA) is the classic content-based index [13], while tension-based indices include the A–a gradient, used in APACHE risk algorithms [14], and the ratio between the arterial oxygen tension and the inspired oxygen fraction (PaO2/FiO2 ratio or PF ratio), important in ARDS diagnosis and stratification [15].

These indices show significant signal variability [16], but their greatest drawback is the limited information provided on the underlying pulmonary pathophysiology. The VA approach of Riley and Cournand [13, 17] is more informative on this aspect, but hampered by inherent over-simplification. This is because VA (V/Q = 0) is one of just two perfused compartments (V/Q = 0 and 1). All oxygen transfer deficits are corralled within VA, in other words as true shunt, leaving no ability to tease out contributions from low V/Q compartments. For clinicians this can be a crucial distinction, for example in managing COVID-19 pneumonia (see “Discussion” section) [10]. Similarly, the effects of high V/Q are incorporated in a single dead space estimate (V/Q = ∞). As a final drawback, accurate VA calculations require mixed venous blood for analysis [12].

In part to address these shortcomings, scaled back variations on the MIGET framework have been proposed [18,19,20,21]. Prominent among these is the automatic lung parameter estimator (ALPE) [18], described as a ‘simple bedside alternative to MIGET’. ALPE has been shown to match complex MIGET calculations in experimental lung injury [22, 23], and is now finding application in clinical research [24] and as the key component of a commercial package (www.mermaidcare.com) designed for monitoring and decision support.

Like MIGET, shunt is given conventionally in ALPE assessments as percentage of cardiac output. However, unlike MIGET, ALPE models ‘low’ and ‘high’ V/Q mismatch as partial pressure differentials (to be distinguished from diffusion limitation) across imposed ‘partitions’ between blood and alveolar gas. Specifically, ‘low’ V/Q mismatch is represented by the fall in PO2 from alveolar gas to pulmonary end-capillary blood, and ‘high’ V/Q mismatch as the rise in PCO2 across the same interface.

We suggest that machine learning (ML) could add value in this ‘scaled back MIGET’ space [25, 26]. With data inputs close to those used by ALPE it should be possible for trained ML applications to generate detailed pulmonary assessments. These could take the form of a shunt estimate plus separate parameters defining log normal distributions of blood flow across compartmental V/Q ratios. Critical care physicians would then be provided with prompt actionable diagnostic information presented in a familiar format. Added bonuses could include shorter measurement intervals with a reduced requirement for FiO2 ‘switching’ (at present ALPE requires up to four FiO2 ‘switches’).

To investigate this possibility, we tested the following hypotheses in silico:

  1. (1)

    Trained ML applications using data normally sourced from blood gas analysis, indirect calorimetry, and cardiac output measurements can quantify pulmonary gas exchange in terms describing a multi-compartment V/Q model of pulmonary blood flow.

  2. (2)

    Consistent ML reports require measurement data at no more than two FiO2 settings.

2 Materials and methods

To test the above hypotheses, we exposed selected ML applications to simulated clinical monitoring data routinely available from blood gas analysis, indirect calorimetry, and cardiac output measurements. Scenarios were constructed with these data to represent a diverse mix of O2 consumption (VO2) and delivery, CO2 production (VCO2) and transport, hemoglobin-oxygen affinity, and respiratory and metabolic acid–base status. Paired blood gases were generated in each simulation by a 21-compartment model of pulmonary blood flow governed by three input values: shunt percentage, log standard deviation (log SD) and distributional mean (Fig. 1, for more model detail and core equations, see Supplementary Material).

Fig. 1
figure 1

Graphical illustration of modelled blood flow through 20 gas exchanging compartments plus a single shunt compartment (V/Q = 0). Shunt is set at 10% of total pulmonary blood flow. Note the log normal distribution of the non-shunt pulmonary blood flow according to compartment V/Q ratios. In this example log SD = 2.0 and flow distributional V/Q mean = 0.35

To make the evaluation, ML applications trained on this material were challenged with simulated monitoring data from ‘unseen’ test scenarios, the goal in each case being to back-generate the three governing model parameters of pulmonary blood flow distribution (shunt, log SD and mean). These estimates were then compared with ‘true’ model input values for the same scenarios.

Steps in this process were as follows:

  1. (1)

    Arterial blood gases were produced by the lung model at two structured settings of inspired oxygen fraction (FiO2) (see below) in response to unique input combinations of the three parameters defining model pulmonary blood flow distribution (shunt, log SD and mean, Table 1) plus one value from each of six monitoring categories (Table 2) available from blood gas analysis, indirect calorimetry, and cardiac output measurements.

    Table 1 Model defining parameters
    Table 2 Monitoring inputs with ranges
  2. (2)

    Using a Python program, 34,551 unique input combinations were built around a core set of 7500.

  3. (3)

    Model calculations were run from VBA sub-routines (Excel, Microsoft, Redmond, WA) until stable outputs were achieved for pH, PCO2, PO2 and Hb saturation in arterial and mixed venous blood and in the pulmonary end-capillary blood of each of the 20 non-shunt compartments.

  4. (4)

    For each input combination, the FiO2 generating an arterial oxygen saturation (SaO2) of 0.90 was determined by iteration, ensuring that in each case 0.21 ≤ FiO2 ≤ 0.90.

  5. (5)

    On attainment of SaO2 = 0.90, values were logged for FiO2, arterial pH, arterial PO2 (PaO2), arterial PCO2 (PaCO2), calculated PF ratio and calculated venous admixture (VA).

  6. (6)

    For the second calculation the FiO2 was increased by 0.1 and the model run again.

  7. (7)

    Values for SaO2 and calculations of VA, and PF ratios were logged at this higher FiO2.

  8. (8)

    With data from SaO2 = 0.90 as baseline, changes at the higher FiO2 in SaO2 (Dsat), VA (DVA) and PF ratios (DPF) were calculated and logged.

  9. (9)

    This sequence performed 34,551 times generated the final dataset.

2.1 ML analysis of completed dataset

  1. (1)

    After pre-processing to reduce redundancies, data rows were formatted as in Table 3 and subjected to randomization.

Table 3 Example of pre-processed data for ML training
  1. (2)

    The randomized dataset was partitioned into sequential split fractions (70%:20%:10%) for ML training, validation and testing respectively.

  2. (3)

    The test fraction was subjected to trained ML analysis with columns containing the model-defining values of pulmonary blood flow (shunt, log SD and mean) ‘held back’ to allow blinded estimates.

  3. (4)

    Two categories of ML estimates were performed:

    1. (a)

      ‘Single-Point’ estimates were derived by ML analysis of 10 variables confined to model input and output logs for SaO2 = 0.90. Input variables were ‘CO2load’, ‘O2pull’, standard P50 (P50st) [27], base excess, BE [28], and blood haemoglobin concentration (Hb). Output variables were FiO2, arterial pH, PaCO2, PaO2, and VA (Table 3).

  4. (b)

    ‘Two-Point’ estimates were derived after inclusion of three additional variables consisting of DVA, Dsat and DPF (Table 3), all obtained from model output logs following the 0.10 FiO2 increment.

2.2 ML methodology

We used open-source ML algorithms implementing linear regression techniques [Supplementary Material Table 1(s)]. It became evident during the validation process that multiple simultaneous models in a ‘stacked’ or ‘ensemble’ configuration outperformed any single model. The stacking process used simple linear regression at the output layer to combine the contributions from individual models.

Model stacks were tested using ‘StackingRegressor’ from the ‘sklearn’ Python library (https://scikit-learn.org/stable/). Models were trained using correlation (‘R’ and ‘R2), mean absolute error (‘MAE’) and by comparing the slope and distance from zero intersection of the line of best fit.

See ‘Supplementary Material’ for more detail of ML methodologies employed.

2.3 Statistical analysis

Prior to analysis, the comparison data were checked for completeness, accuracy, and consistency.

Two-way (univariate) comparisons were made using standard linear regression. Post-estimation diagnostics were run on all models. Due to the large size of the dataset, these included checking model residuals for normality, using both the Kolmogorov–Smirnov test and a normal probability plot and heteroskedasticity, using the Breusch–Pagan and Cook–Weisberg tests. For each predictor, the regression slope (β) and its p-value were tabulated along with the equation intercept and the overall R2 value.

Kernel density plots and graphical Bland and Altman analyses [29] were constructed to enable visual comparisons of single-point and two-point results for each variable (shunt, log SD, and mean estimates) versus the true values.

STATATM (v17.0) was used for all analyses with the level of significance set throughout at α < 0.05.

3 Results

From the final dataset of 34,551 data rows, 31,097 rows were allocated for ML training and validation and the remaining 3454 rows for testing.

From the 3454-row test-set, kernel density and Bland and Altman plots of single-point and two-point estimates by ML versus true values of shunt, log SD and mean are set out in Figs. 2, 3, 4, 5, 6 and 7. All distributions are non-normal. Corresponding regression data are reported in Table 4, and Bland and Altman data in Table 5.

Fig. 2
figure 2

Shunt (single-point). Two subplots are illustrated. The Bland–Altman (BA) plot illustrates the 3454 points. For clarity, each point is horizontally jittered by ± 1% of the value of the independent variable. Horizontal plot lines indicate the median and 95% confidence interval for the difference (enumerated in Table 5). The kernel density estimate (KDE) plot illustrates the distribution of observations for the independent variable. The solid line is the true value of the variable with the dashed line indicating the modeled variable. Each subplot shares the same X-axis scale. Both X-axis units and the Y-axis units in the BA plot are defined by the independent variable. The Y-axis in the KDE plot is dimensionless

Fig. 3
figure 3

Shunt (two-point). Description as for Fig. 2

Fig. 4
figure 4

Mean (single-point). Description as for Fig. 2

Fig. 5
figure 5

Mean (two-point). Description as for Fig. 2

Fig. 6
figure 6

Log SD (single-point). Description as for Fig. 2

Fig. 7
figure 7

Log SD (two-point). Description as for Fig. 2

Table 4 Linear regression analysis: single-point and two-point estimates of shunt, log SD and mean versus true input values
Table 5 Results for Bland–Altman plots

3.1 Two-point estimates

Two-point estimates of shunt, log SD and mean produced regression models with almost identical results (Table 4), with β ~ 1.00, intercept ~ 0.00 and R2 ~ 1.00 for each of the test-set variables. The kernel density and Bland and Altman plots confirmed close agreement with true values (Figs. 3, 5, 7; Table 5).

3.2 Single-point estimates

From Figs. 2, 4 and 6 and Tables 4 and 5, single-point estimates showed close concordance but less consistent reflections of true values. Ranges from the regression models of the three estimate categories versus true values were R2 = 0.77–0.89, β = 0.991–0.993, and intercepts = 0.009–0.334 (Table 4).

4 Discussion

Using computer simulation, we found that blinded ML analysis of monitoring data replicating diverse gas exchange scenarios, including blood gases generated by a 21-compartment V/Q model of pulmonary blood flow, could back-generate the model’s governing parameters. This was achieved with ‘stacked regressor’ ML ensembles trained and tested on blood gas, indirect calorimetry, and cardiac output data over a broad spectrum of gas exchange equilibria. In each simulation ML accurately delineated pulmonary blood flow as shunt percentage plus the key descriptors (log SD and mean) of log normal flow distributions to gas exchanging compartments according to their V/Q ratios. This is essentially pulmonary blood flow in MIGET format.

Measurements adopted for the simulation are available from current ICU monitoring devices [30]. Point of care blood gas analysis has been routine in ICU practice for decades. Indirect calorimetry is now recommended as a nutritional guide for critically ill mechanically ventilated patients [31,32,33]. Low invasive cardiac output monitoring, although not without problems [34,35,36], is mainstream in contemporary ICUs. The application of artificial intelligence in critical illness monitoring and decision support is itself no longer a novel concept [26].

The dataset to train, validate and test the ML applications was derived from systematically varied input combinations of the three model defining parameters (shunt, log SD, and mean, Table 1), linked to four direct measurements (cardiac output, VO2, VCO2, and Hb; Table 2) and two calculated parameters (BE, P50st; Table 2). To complete each scenario the model generated paired sets of arterial blood gases in response to these inputs at two structured FiO2 settings. The final dataset represented approximately 35,000 unique scenarios covering a diverse mix of O2 delivery and consumption, CO2 production and transport, hemoglobin-oxygen affinity, and respiratory and metabolic acid–base status.

ML was then able to back-generate the model-defining parameters of 3454 test scenarios in blinded fashion using only the blood gas measurements along with inherent derived values (BE, P50st, VA, PF ratios) plus cardiac output, VO2, VCO2, and the baseline FiO2. ML estimates from single-point data (recorded at baseline SaO2 = 0.90) showed sufficient concordance with true values to reflect trends in all three key model parameters. However, a second equilibration introduced a dynamic component, captured by ML via changes in VA (DVA), PF ratios (DPF) and saturation (Dsat). This two-point approach enabled high fidelity identification of all three key model descriptors (Figs. 3, 5, 7; Tables 4, 5).

The simulation was designed to emulate a practical two-step procedure in which arterial blood gas analysis with oximetry is performed with the FiO2 adjusted for SaO2 = 0.90 (using SpO2 as initial guide). This is followed by a second set of blood gases after increasing the FiO2 by 0.10. During this process, once only measurements of cardiac output, VO2 and VCO2 are also recorded. ML then quantifies the defining parameters of the diagnostic model(s) of choice from relationships embedded in the data.

It should be possible to train ML applications in other diagnostic models such as the ALPE system, which like the approach considered here devolves to three key parameters [18, 37], in that case shunt and partial pressure gradients across modelled blood/gas ‘partitions’ representing ‘high V/Q’ and ‘low V/Q’ mismatch. It is also conceivable that larger training datasets with wider input ranges could enable accurate single-point ML reports from data ‘snapshots’ collected at any working FiO2. One further possibility for future investigation is that training sets formatted to target specific model variants, for example bimodal flow distributions [38], could extend ML reporting to these complexities.

Informative ‘on the spot’ gas exchange evaluations can facilitate management decisions, as mentioned in the Introduction. A contemporary example might be a ventilated patient with pneumonia and hypoxemia with a PF ratio < 100. To decide on a safe course of action clinicians should be able to distinguish between two extremes of lung pathophysiology. At one extreme the disturbed oxygenation represents a large right to left shunt in the context of low pulmonary gas volumes, typical of recruitable ARDS. At the other pulmonary gas volumes are normal and shunt is minimal, the hypoxemia arising instead from widespread low V/Q ratios due to maldistributed lung perfusion, a situation more characteristic of COVID-19 with multiple pulmonary vascular thrombi. In the latter circumstance, recruitment maneuvers and major manipulations of positive end expiratory pressure (PEEP) would be contraindicated [10, 38]. Varying combinations of the two extremes complete the spectrum of possibilities.

Based on our simulation, ML evaluations could make these distinctions rapidly without a need for specialized imaging. Equivalent diagnostic assessment by the current ALPE system would take 10 to 15 min, involve up to four FiO2 ‘switches’, and report VQ mismatch as partial pressure gradients [24, 37].

4.1 Some caveats

The model of pulmonary blood flow used to generate the blood gases follows the basic West model format. Several modifications and simplifications were employed. These are detailed in the Supplementary Material.

The simulation assumes error-free measurements, whereas some degree of error is intrinsic to measurements of cardiac output [36], indirect calorimetry [39], and the measured and derived elements of blood gas analysis [12]. Indirect calorimetry has increased error potential at FiO2 ≥ 0.7 or PEEP > 12, both encountered in severe respiratory failure [39]. Other risk factors include circuit leaks, bronchopleural fistulae, and possibly extracorporeal circulations.

We have not attempted a sensitivity analysis. However, it is noteworthy that ALPE, an advanced system now in service, is subject to similar error susceptibilities. ALPE evaluations require a single arterial blood gas analysis and one cardiac output measurement or estimate, along with measurements at three to five different FiO2 settings of VO2, VCO2, arterial oxygen saturation by pulse oximetry (SpO2), and end-tidal O2 and CO2 fractions [24]. Despite measurement intervals of 10–15 min with up to four FiO2 ‘switches’, any signal distortion from absorption atelectasis [40] and altered hypoxic pulmonary vasoconstriction [41] is regarded as minor [37].

Further, the MIGET gold-standard itself relies on a series of measurements and techniques all prone to error, including but not limited to cardiac output and minute ventilation measurements, collection of mixed expired gas without condensation-induced loss of dissolved gases, and gas chromatographic concentration measurements of six inert gases in both mixed expired gas and the gas phases above blood samples [4].

The low baseline arterial saturation (SaO2 = 0.90) was selected to allow a subsequent 0.10 FiO2 step-up within the bounds of FiO2 ≤ 1.00. Although SaO2 = 0.90 is at the hypoxemia threshold [12], it is considered adequate for tissue oxygenation in the absence of anemia and low cardiac output, albeit with limited supportive evidence [42]. Of historical interest, older versions of the automated ALPE system could manipulate baseline SaO2 to values as low as 0.85, if necessary using FiO2 < 0.21 [18].

Dataset shunt, log SD and mean values retained uneven distributions across their respective ranges, as illustrated by the test-set kernel density plots (Figs. 2, 3, 4, 5, 6, 7). Greater training set uniformity may have produced more consistent single-point estimations. Barriers to uniformity included the automatic rejection of input combinations in which SaO2 ≠ 0.90 when FiO2 ≥ 0.21 ≤ 0.90.

5 Conclusions

We conclude based on computer simulations of diverse gas exchange scenarios that trained ML applications using data sourced from blood gas analysis, indirect calorimetry, and cardiac output measurements can quantify pulmonary gas exchange in terms used to describe multi-compartment V/Q models of pulmonary blood flow. High fidelity ML reports require measurement data at no more than two FiO2 settings, subject to measurement accuracy.