Main

Modern healthcare systems generate a vast amount of high-dimensional clinical data (HDCD), such as spirograms, photoplethysmogram (PPG), electrocardiogram (ECG), computed tomography and magnetic resonance imaging, that cannot be summarized as a single binary or a continuous number (such as ‘has asthma’ or ‘height in centimeters’). HDCD provide a unique opportunity to reveal the genetic architecture of diseases and complex traits when coupled with biobank-scale genetic data1,2,3,4,5,6, but we lack statistical methods to fully use HDCD in genome-wide association studies (GWAS), as standard GWAS require the phenotype of interest to be encoded as a single scalar.

The most common method for GWAS on HDCD uses a small number of expert-defined features (EDFs) extracted from the HDCD as the target phenotypes. For example, spirograms are a graphical representation of spirometry test results, a widely used clinical test for lung function that measures airflow and volume over time7,8. Spirograms can be summarized into EDFs, including forced vital capacity (FVC), forced expiratory volume in the first second (FEV1), FEV1/FVC (nonlinear function of FVC and FEV1), peak expiratory flow (PEF) and forced mid-expiratory flow (FEF25−75%)9. Spirogram EDFs are used in clinical settings to diagnose diseases such as chronic obstructive pulmonary disease (COPD)10,11. In another example, PPG measures volumetric changes in peripheral blood circulation using infrared light. Previously studied EDFs of PPG include the presence (or absence) of a notch, position of the notch, position of the peak, position of the shoulder and peak-to-peak time12,13,14,15,16. PPG EDFs have known associations with cardiovascular diseases, such as coronary heart disease12. Spirograms and PPG EDFs are heritable, and GWAS on EDFs have helped identify the genetic architecture of lung17,18,19 and circulatory function20,21,22. However, EDFs may not capture all heritable signals encoded in spirograms or PPGs, thus GWAS on these EDFs may not exploit the full potential of these HDCD.

A simple approach to HDCD GWAS performs GWAS on each data coordinate (for example, time point or pixel). For example, previous work performed GWAS on each recorded ECG time point23. This approach is computationally expensive and has low statistical power due to the high correlation of nearby coordinates and the massive multiple-testing burden24,25. A popular alternative performs principal component analysis (PCA)26 on the HDCD and then GWAS on a subset of the PCs27. However, PCA assumes a linear relationship between the raw HDCD and the underlying biological factors of interest and does not explicitly model spatial or temporal structure. Moreover, performing GWAS on a subset of PCs may miss heritable signals, which are often small.

Machine learning (ML)-based phenotyping uses HDCD as input to a supervised ML model to predict trait labels and then performs GWAS using the model predictions as the target phenotype3,6,28. While ML-based phenotyping can augment standard GWAS on manually defined trait labels, the supervised model only learns signals related to the specific target trait. Additionally, for the common case in which the supervised model uses deep learning, many labeled examples may be required to achieve good performance.

To overcome these limitations, we developed a principled method, Representation Learning for Genetic Discovery on Low-Dimensional Embeddings (REGLE), that is computationally efficient, requires no labels and can incorporate information from EDFs if available. REGLE is based on the variational autoencoder (VAE)29 model. Although VAEs have previously been applied to metabolomics data30, the utility of VAE embeddings for GWAS, polygenic risk scores (PRSs) and downstream analyses has not been previously explored. We apply REGLE in two case studies to understand the genetic architecture of lung function from raw spirograms and circulatory function from PPG. Compared to GWAS on spirogram and PPG EDFs, our GWAS on the learned encodings recovers the most known genetic loci linked to lung and circulatory function while also detecting additional loci. PRS created from loci identified via GWAS of REGLE of spirograms improves COPD and asthma predictions. Similarly, PRSs derived from REGLE of PPG improve hypertension (HTN) and systolic blood pressure (SBP) predictions. These results indicate that REGLE successfully extracts a meaningful representation of lung function from spirograms and of circulatory function from PPG, which in turn improves genetic discovery and risk prediction.

Results

Overview of REGLE

REGLE consists of three main steps. First, we learn a nonlinear, low-dimensional, disentangled representation (that is, an encoding) of the HDCD using a VAE29 trained to compress and reconstruct HDCD (Fig. 1; Methods). Autoencoders consist of an encoder and a decoder, connected by a low-dimensional ‘bottleneck’ layer. The encoder summarizes the input data into a small set of numbers at the bottleneck layer, and the decoder reconstructs the input data from the low-dimensional summary31. VAE29 is a special type of autoencoder that introduces stochasticity in the encoder. The VAE implicitly forces the learned encodings to be relatively disentangled32, that is, the encodings have relatively uncorrelated coordinates and separable biological factors can be better captured in each coordinate. Second, we perform GWAS independently on each encoding coordinate. Third, we use PRSs from the encoding coordinates as genetic scores of general biological functions and potentially combine them to create a PRS for a disease or trait of interest (Fig. 1).

Fig. 1: Overview of REGLE.
figure 1

In step 1, we learn a low-dimensional embedding using a VAE, where we optionally condition the decoder on EDFs. In step 2, we perform GWAS on all learned coordinates (and EDFs if they are used). Finally, in step 3, we train a small linear model to learn weights for each latent coordinate PRS to obtain the final disease-specific PRS.

REGLE enables relevant EDFs to be optionally included in the input to the decoder of the model so that the encoder is encouraged to learn only the residual signals not represented by the EDFs (Fig. 1). This ability to incorporate prior knowledge of important data features (from users or clinicians) is a key advantage of REGLE.

Overview of REGLE on spirograms

We applied REGLE to obtain low-dimensional representations of spirogram curves, which we call spirogram encodings (SPINCs; Fig. 2). To construct SPINCs, we trained a convolutional VAE29 to reconstruct spirograms (Fig. 2a; Methods). In addition, we constructed another set of encodings we call residual spirogram encodings (RSPINCs) by injecting five EDFs (FEV1, FVC, FEV1/FVC, PEF and FEF25–75%) as inputs to the decoder when reconstructing flow–volume curves (Fig. 2a). We generated SPINCs and RSPINCs for all individuals (n = 351,120) in the UK Biobank (UKB)33,34 using their first-visit spirogram, excluding individuals whose spirogram failed our quality control (QC) measures (Methods). We used 80% of the individuals whose genetically inferred ancestry (GIA) is European (n = 259,692) to train the (R)SPINCs models and 20% (n = 65,266) to evaluate reconstruction performance and choose hyperparameters (Extended Data Fig. 1 and Supplementary Table 1; Methods). Using just five SPINCs (the number of common spirogram EDFs), we observed highly accurate reconstruction of the input spirograms (Fig. 2b). SPINCs consistently outperformed an equivalent number of PCs in terms of reconstruction accuracy at small latent dimensions (Fig. 2c, Supplementary Table 2 and Supplementary Note). We observed similarly accurate reconstructions using EDFs + RSPINCs and confirmed that the addition of RSPINCs improves the reconstruction quality significantly, compared to using a decoder-only model to reconstruct curves from EDFs only (Extended Data Fig. 2). We used two RSPINCs to balance the number of additional coordinates and the reconstruction accuracy. Notably, the learned representations are highly consistent when trained with multiple different initializations (Extended Data Fig. 3 and Supplementary Note).

Fig. 2: Overview of REGLE on spirograms.
figure 2

a, Learning SPINCs using a convolutional VAE and RSPINCs using a convolutional VAE with ‘feature injection’, for example, using EDFs. b, Reconstructing a spirogram (volume–time curve) from SPINCs (dim = 5). c, Reconstruction errors (mean squared error across time points) for reconstructed spirograms using the SPINCs model (blue) and PCA (orange) with a varying latent dimension. Both the SPINCs model and PCA are trained (or ‘fitted’) on a training set, and the reconstruction error is evaluated in a separate validation set. d, Spirograms created by RSPINCs (dim = 2) decoder using a fixed set of injected features (that is, EDFs) and varying one RSPINC coordinate while fixing the other one to be zero. Line color indicates the varying RSPINC coordinate value from low (blue) to high (yellow).

(R)SPINCs are partially interpretable

Leveraging the generative nature of REGLE models, we studied the influence of RSPINC coordinates on spirogram shape by fixing the values of EDFs (obtained from a randomly selected individual in the validation set) and varying one RSPINC coordinate while keeping the other one fixed at zero and generating the corresponding flow–volume spirograms using only the decoder portion of the RSPINCs model (Fig. 2d). A typical flow–volume spirogram consists of the following two distinct parts: a relatively brief part to reach peak flow where the flow increases monotonically as the volume increases, and the main part of the spirogram where the flow decreases monotonically. In Fig. 2d, we clearly observed that varying the first coordinate of RSPINCs amounts to widening or narrowing of the second part (negative slope) while keeping the first part relatively fixed. Similarly, varying the second coordinate of RSPINCs widens or narrows the first part (positive slope) while keeping the second part relatively fixed. Notably, when varying either coordinate, the maximum flow value (PEF) and the final volume value (FVC) stay roughly the same, as expected because all EDFs were fixed.

Overview of REGLE on PPGs

We applied REGLE to obtain low-dimensional representations of PPG curves computed from a median single heartbeat, which we call PPG encodings (PLENCs; Fig. 3). To construct PLENCs, we trained a convolutional VAE29 to reconstruct PPGs (Fig. 3a; Methods). We generated PLENCs for all individuals (n = 170,714) in UKB33 using their first-visit PPG, excluding individuals whose PPG failed our QC measures (Methods). We used 80% of the European GIA individuals (n = 136,239) to train the PLENCs models and 20% (n = 34,475) to evaluate the reconstruction performance and choose hyperparameters (Extended Data Fig. 4 and Supplementary Table 1; Methods). With just five PLENCs (the number of PPG EDFs), we observed a highly accurate reconstruction of the input PPG (Fig. 3b and Supplementary Table 3). PLENCs consistently outperformed PCs in terms of reconstruction accuracy at small latent dimensions (Fig. 3c and Supplementary Note). We also constructed residual PPG encodings (RPLENCs) by injecting five PPG EDFs (absence of notch, position of notch, position of peak, position of shoulder and peak-to-peak time).

Fig. 3: Overview of REGLE on PPG.
figure 3

a, Learning PLENCs using a convolutional VAE. b, Reconstructing PPG from PLENCs (dim = 5). c, Reconstruction errors (mean squared error across time points) for reconstructed PPGs using the PLENCs model (blue) and PCA (orange) with a varying latent dimension. Both the PLENCs model and PCA are trained (or ‘fitted’) on a training set, and the reconstruction error is evaluated in a separate validation set.

(R)SPINCs and (R)PLENCs encode information beyond EDFs

Some SPINCs and PLENCs are highly correlated with known EDFs (Pearson correlation r between SPINC3 and FVC is 0.96; r between PLENC3 and position of the shoulder is 0.74; Extended Data Figs. 5 and 6), while both RSPINCs coordinates have low correlation (∣r∣ < 0.3) with EDFs as expected (Extended Data Fig. 5). (R)SPINCs and (R)PLENCs are also correlated with other predictors of lung function (covariates), such as age, sex, height, body mass index and smoking status (Extended Data Fig. 5).

We residualized both the EDFs and the covariates from (R)SPINCs and (R)PLENCs and computed correlation with tabular UKB features (UKB phenotypes whose types are a real number, integer, date, binary or categorical). Multiple groups of fields strongly and significantly correlated with the (R)SPINCS and (R)PLENCs even after residualizing (Supplementary Tables 48 and Supplementary Note). Both (R)SPINCs and (R)PLENCs were associated with overall survival. For example, SPINC3 had a hazard ratio of 0.68 (95% confidence interval (CI), 0.65–0.71; P = 1.6 × 10−83 under the Cox proportional hazards model), implying the hazard of death decreased by 32% per one s.d. increase in the coordinate (Supplementary Note, Extended Data Fig. 7, Supplementary Figs. 13 and Supplementary Table 9; Methods).

REGLE detects new loci for lung and circulatory functions

We generated SPINCs (dim = 5), RSPINCs (dim = 2, in addition to five EDFs) for all individuals with valid first-visit spirograms in UKB (Extended Data Fig. 1 and Supplementary Figs. 4 and 5; Methods) and PLENCs (dim = 5), RPLENCs (dim =2, in addition to five EDFs) for all individuals with valid first-visit PPGs in UKB (Extended Data Fig. 4; Methods). We then performed GWAS on all European GIA individuals across all encoding coordinates, five spirogram EDFs and five PPG EDFs using BOLT-LMM35,36, adjusting for covariates (Supplementary Note and Supplementary Figs. 619; Methods). (R)SPINCs and (R)PLENCs have significant SNP heritability (Supplementary Table 10 and Supplementary Note; Methods), indicating the presence of genetic signals not captured by the EDFs (Supplementary Table 10). Furthermore, SPINCs and PLENCs GWAS have higher power (measured by expected chi-square statistics) compared to PCA GWAS35,36 (Supplementary Tables 11 and 12; Methods).

GWAS on five SPINCs detected 575 independent genome-wide significant (GWS) loci (r2 ≤ 0.1 and P ≤ 5 × 10−8) after merging hits within 250 kb together (Table 1; Methods). Most GWS loci from SPINCs and EDFs + RSPINCs recover previously known loci19 (89% for SPINCs and 90% for EDFs + RSPINCs). SPINCs discovered more previously unknown GWS loci (65 of 575, 11%) than EDFs or PCA (Table 1 and Supplementary Note). We observed similarly superior (R)SPINCs performance when compared to a baseline model of nonlinear cubic spline coefficients instead of linear PCs (Supplementary Table 13) and when excluding UKB samples from ref. 19 (Supplementary Table 14). Functional enrichment analysis with GARFIELD37 shows that these loci are enriched for lung tissue DNase I hypersensitive sites (Supplementary Figs. 2026 and Supplementary Note) and the EDFs + RSPINCs loci show stronger ontology term enrichments than EDFs loci alone (Extended Data Fig. 8) using GREAT38. We performed multiple analyses to ensure that these previously unknown loci were not detected by EDFs or previous work (Supplementary Note and Supplementary Tables 15 and 16).

Table 1 Comparison of GWAS significant loci

GWAS on five PLENCs detected 90 independent GWS loci (Table 1; Methods). We compared our PLENCs GWS loci to all cardiovascular function-related loci from the GWAS Catalog39 (Methods; 520 known independent loci) and GWAS on PPG EDFs. Of the 90 GWS PLENCs loci, 50 (56%) were not previously known (Table 1 and Supplementary Table 17). Functional enrichment analysis showed that PLENCs GWS loci are enriched for fetal heart, heart and blood vessel tissue DNase I hypersensitive sites (Supplementary Figs. 2733 and Supplementary Note).

(R)SPINCs improve asthma and COPD PRS over EDFs in UKB

We computed PRSs using BOLT-LMM35,36 effect sizes for five SPINC and two RSPINC coordinates, in addition to five spirogram EDFs. We treated these sets of PRSs as intermediate genetic scores for lung function. Given a specific trait, a set of such intermediate PRSs and a (small) set of individuals for whom the trait status is available, one can combine the intermediate PRSs into a single trait-specific PRS via a weighted linear sum of the intermediate PRSs. We created disease-specific PRSs for asthma and COPD from the following three sets of intermediate PRSs: (1) five EDFs, (2) five SPINCs and (3) five EDFs plus two RSPINCs. We learned the disease-specific PRS weights within the modeling set (n = 324,958) of European GIA individuals in UKB using medical-record-based asthma and COPD statuses. To evaluate the performance of each disease-specific PRS, we computed the accuracy of the PRS in a completely separate set of individuals from the European GIA (n = 110,722) not previously used for model training or GWAS.

We observed that the SPINC asthma PRS stratifies the risk groups more effectively than the EDF PRS on both ends of the risk spectrum (Fig. 4 and Supplementary Table 18). In addition, we observed statistically significant improvements in area under the receiver operating characteristic curve (AUC-ROC), area under the precision-recall curve (AUC-PR) and Pearson correlation using the SPINC PRS (Supplementary Table 18). We observed the same trend for COPD (Fig. 4 and Supplementary Table 19). Furthermore, we observed that the EDF + RSPINC PRS significantly outperforms the EDF PRS on almost all metrics for both asthma and COPD (Fig. 4 and Supplementary Tables 18 and 19). We observed that the SPINC COPD PRS outperforms the FEV1/FVC PRS (Supplementary Table 19) for predicting medical-record-based COPD, despite FEV1/FVC having been shown to be one of the best phenotypes for generating a COPD PRS, even outperforming a PRS created from a GWAS of COPD directly6. Finally, we observed that for both diseases, the SPINC and EDF + RSPINC PRSs outperform the PRS generated by baseline methods such as PCA (Supplementary Tables 18 and 19). These results provide further evidence that SPINCs capture more genetic determinants of lung function related to asthma and COPD than the same number of EDFs, and RSPINCs capture additional genetic factors not captured by the EDFs.

Fig. 4: PRS using SPINCs and RSPINCs in UKB.
figure 4

Combined PRS for medical-record-based asthma and COPD using the following three sets of intermediate PRS: five EDFs, five SPINCs and five EDFs + two RSPINCs. Each set of PRS is combined by a linear model trained using the target phenotype labels, and the prevalence of the phenotypes in the top and bottom 5%, 10% and 20% PRS individuals is evaluated in a separate evaluation set. Vertical line segments indicate 95% CIs generated by bootstrapping (300 repetitions), and the center points are the bootstrapping means. The horizontal dashed lines show the total prevalence. Star symbols indicate a statistically significant difference between the two methods using paired bootstrapping (300 repetitions) with 95% confidence (that is, two-sided P < 0.05). Lower is better for the bottom percentiles; higher is better for the top percentiles.

We then explored whether disease-specific weights could be learned from a subset of the training data. For both asthma and COPD, the (R)SPINC-based PRS fit with as few as 100 disease cases performed indistinguishably from those trained on the full training data (Fig. 1 (step 3) and Extended Data Fig. 9). Finally, we evaluated PRS generated by GWAS with a cohort-level phenotype adjustment using inverse-normal transformation40. While we observed fewer significant differences in this case, SPINCs and EDFs + RSPINCs maintained statistically significant improvement for asthma (Supplementary Fig. 34).

(R)SPINC PRS transferred to multiple datasets and ancestry

To test the generalizability of our (R)SPINC PRSs to individuals outside the UKB and those of non-European GIA, we transferred our asthma and COPD PRSs to the Genetic Epidemiology of COPD (COPDGene)41, eMERGE III (dbGaP accession phs001584.v2.p2), European Prospective Investigation into Cancer in Norfolk (EPIC-Norfolk)42 and Indiana Biobank datasets43 (Supplementary Table 20).

For COPDGene, we observed that the SPINC PRS outperforms the EDF PRS on all four evaluation metrics for COPD. In the ‘non-Hispanic white’ subset (n = 6,576), differences in all four metrics were statistically significant (Fig. 5a and Supplementary Table 21). In the ‘African American’ subset (n = 3,140), differences were statistically significant for AUC-ROC and Pearson correlation (Supplementary Table 21). The EDF + RSPINC PRS significantly outperformed the EDF PRS in AUC-ROC and Pearson correlation in ‘non-Hispanic white’, but did not in the ‘African American’ subset (Supplementary Table 21 and Supplementary Note).

Fig. 5: SPINC PRS transferred to multiple independent datasets.
figure 5

SPINC PRSs (blue) for COPD and asthma generated on UKB are transferred to the following four independent datasets and evaluated against EDF PRSs (orange): COPDGene, eMERGE III, EPIC-Norfolk and Indiana Biobank. a, PRS evaluation in COPDGene dataset on COPD. b, PRS evaluation in eMERGE III dataset on asthma. c, PRS evaluation in EPIC-Norfolk study on COPD and asthma. d, PRS evaluation in Indiana Biobank on COPD and asthma. In all figures, solid vertical intervals represent 95% CIs generated by bootstrapping (300 repetitions), and the center points are the bootstrapping means. The horizontal dashed lines show the total prevalence in the evaluation set. Star symbols indicate a statistically significant difference between the two methods using paired bootstrapping (300 repetitions) with 95% confidence (that is, two-sided P < 0.05).

We also transferred the UKB PRSs to eMERGE III (‘white’ subset, n = 8,288), EPIC-Norfolk (self-reported ‘white’, n = 21,010) and the Indiana Biobank (mostly European GIA, n = 5,254; Methods) to evaluate asthma, asthma and COPD and asthma and COPD, respectively. We observed consistent improvement from using SPINC PRSs over EDF PRSs for both COPD and asthma phenotypes for top-percentile prevalences, AUC-ROC and AUC-PR. The improvement was statistically significant for AUC-PR and the top 1% and 5% prevalence in eMERGE III and for AUC-ROC and AUC-PR in EPIC-Norfolk (Fig. 5b–d).

PLENCs improve hypertension and blood pressure PRS over EDFs

We computed PRSs for the five PLENCs and two RPLENCs plus five PPG EDFs and then used these sets of PRSs as intermediates for constructing cardiovascular function PRSs. We created trait-specific PRSs for HTN and SBP using the REGLE framework (Fig. 1 and Supplementary Table 22). We evaluated HTN and SBP PRSs generated by PLENCs and PPG EDFs in independent datasets (COPDGene, eMERGE III and EPIC-Norfolk) in addition to the held-out UKB test set. We did not evaluate cardiovascular PRSs in Indiana Biobank due to the unusually high prevalence of HTN (more than 80%) and blood pressure medication usage by a majority of its population.

We observed a consistent trend of improvement from using PLENC PRSs over EDF PRSs for both HTN and SBP, except for HTN AUC-ROC in EPIC-Norfolk (Fig. 6). Notably, the PLENC PRS for SBP outperformed the EDF PRS for all datasets for both correlation metrics, for example, 2× higher Pearson correlation (6% versus 3%) in the UKB test set (Supplementary Tables 23 and 24), and the differences were statistically significant in three of four datasets.

Fig. 6: PLENC PRS generated in UKB evaluated in multiple independent datasets.
figure 6

PLENC PRSs (blue) for HTN and SBP generated on UKB are evaluated in the following four independent datasets against EDF PRSs (orange): UKB, COPDGene, eMERGE III and EPIC-Norfolk. a, PRS evaluation in UKB, evaluated in a separate test set not used for GWAS. b, PRS evaluation in COPDGene dataset. c, PRS evaluation in eMERGE III. d, PRS evaluation in EPIC-Norfolk. In all figures, solid vertical intervals represent 95% CIs generated by bootstrapping (300 repetitions), and the center points are the bootstrapping means. The horizontal dashed lines show the total prevalence in the evaluation set. Star symbols indicate a statistically significant difference between the two methods using paired bootstrapping (300 repetitions) with 95% confidence (that is, two-sided P < 0.05).

High association between REGLE encodings and UKB PRSs

We associated (R)SPINCs and (R)PLENCs with PRSs of 7,145 phenotypes computed by the Pan-UKB consortium (Supplementary Note; Methods). The (R)SPINC PRSs showed a strong correlation with traits previously associated with alterations in lung function, for example, systemic lupus erythematosus44,45, thyroid dysfunction46 and gluten-free diet47 (Supplementary Tables 2528). (R)PLENCs exhibited significant correlations with different traits, including blood traits, PPG traits, ECG traits, blood pressure and cardiovascular problems (Supplementary Tables 2932 and Supplementary Note; Methods).

Discussion

Large biobanks provide unique opportunities to identify the genetic factors underlying complex traits and diseases, but accurate phenotyping48 remains a core challenge. We proposed a general unsupervised deep learning method, REGLE, to improve genetic discovery for HDCD. We showcased the effectiveness of REGLE for generating encodings of spirograms and PPGs. These are HDCD which, in addition to being routinely measured in clinical settings, can also be measured passively and noninvasively via smartphones. In fact, PPGs are widely collected by popular wearable devices. We demonstrated that the REGLE are both partially interpretable and effective for identifying genetic variants associated with lung and circulatory functions.

Unsupervised learning of HDCD representations for genomic discovery is attractive owing to the difficulty of manually acquiring EDFs at scale. Previous work has explored applying transfer learning49 and contrastive learning50 to retinal fundus images, or multimodal autoencoders to cardiac data modalities51. A key strength of REGLE is the use of a VAE to generate low-dimensional, nonlinear, disentangled representations. The ability to generate nonlinear representations is desirable for the data applications considered, as the spirogram (Fig. 2) and PPG (Fig. 3) curves seemingly lie close to a low-dimensional manifold and yet are clearly nonlinear. Moreover, VAEs have the following two main advantages over traditional autoencoders: (1) the coordinates of the latent representation are minimally correlated (Extended Data Fig. 5), encouraging them to represent separable biology and increasing power for genetic discovery and PRS (Supplementary Table 33), and (2) the learned representations are stable up to changes in signs or order, which do not affect genetic discovery (Supplementary Note and Extended Data Fig. 3).

To support the principled use of EDFs in modeling, REGLE supports a modification of the VAE in which EDFs are additionally supplied as input to the decoder, implicitly encouraging the encoder to learn features not captured by the EDFs (see Supplementary Note for connection and difference with conditional VAE52). Although these models have slightly lower GWAS power and PRS performance compared to nonresidual REGLE (Table 1 and Supplementary Tables 14, 18, 19, 21, 23 and 24), the residual models are intended for capturing variation in HDCD, which is not well-represented by existing EDFs. For example, one of our RSPINCs captures a property of spirometry curves that pulmonologists refer to as ‘coving’, an indicator of airway obstruction that is not well-represented by the standard EDFs. Moreover, we identified genetic loci associated with this RSPINC (Supplementary Table 16), which may shed light on the mechanisms behind the type of obstruction.

The improved performance of SPINC, EDF + RSPINC, PLENC and EDF + RPLENC PRSs over EDF PRSs provides evidence for the presence of disease-relevant genetic information in HDCD not captured by existing EDFs. Moreover, we developed a label-efficient approach for combining PRSs from GWAS on several learned coordinates. In particular, each coordinate PRS retains its original effect sizes, and a disease-specific PRS is constructed as a learned weighted sum of the handful (that is, five or seven) coordinate PRSs. Because only a minimal number of weights require learning during disease specialization, our premade lung and circulatory system function PRSs can be adapted for risk prediction in new settings with very few disease labels. We hypothesize that unsupervised quantification of other organ systems may be similarly beneficial for improving polygenic prediction across a wealth of diseases. Finally, in cases where labeled data are plentiful, we note that PRS performance can be further improved by jointly estimating disease-specific variant effect sizes across the set of variants associated with our latent coordinates.

There are several limitations to this work. First, we did not directly optimize multiple GWASs for new genomic discovery but used a straightforward (conservative) method to define and merge independent associated loci. A possible extension would be to combine the signals from multiple (R)SPINC and (R)PLENC coordinate GWAS27. Second, the VAE objective and, in particular, the reconstruction error are not necessarily optimal for genetic analyses, and explicitly incorporating an objective to maximize the heritability of the learned representation may be a fruitful line of future research53. Third, we did not fully optimize model architecture and training strategies specifically for genomic discovery (Supplementary Note). Fourth, we generated individual-level spirogram representations from the first measurement, despite some individuals having up to three acceptable blows. Integrating all acceptable blows from an individual could produce a more comprehensive representation of their lung function54. Fifth, REGLE was trained on spirograms and PPG obtained from the UKB only; thus, (R)SPINCs and (R)PLENCs representations may not generalize well to other datasets. One needs additional datasets with the same data modality to investigate the generalizability of the encodings. Finally, model training was performed exclusively on individuals of European GIA. While PRS evaluation was performed on multiple datasets and ancestries, the impact of ancestry-specific model training was not explored.

Despite these limitations, REGLE provides a mechanism for identifying genetic influences on organ function in the absence of labeled data and naturally admits to incorporating expert features into the model. It also provides a method to create disease/trait-specific PRS with very few labels (that is, in the order of hundreds). As biobanks with rich imaging, activity monitoring, medical records and paired genetic data continue to grow, we anticipate that this or similar methods will be increasingly used to further elucidate the genetic underpinnings of human traits and diseases.

Methods

All relevant ethical guidelines have been followed for this research, and any necessary institutional review board (IRB) and/or ethics committee approvals have been obtained. Advarra IRB (Columbia, MD) waived ethical approval for this work involving de-identified medical imagery and metadata under 45 Code of Federal Regulations 46. Work related to genomics data was additionally reviewed by the respective data sources—UKB, COPDGene, eMERGE III, EPIC-Norfolk and Indiana Biobank. This research has been conducted using the UKB resource under application 65275.

UKB data preparation for spirograms

Spirograms from UKB were sourced from the data field 3066, which contains the volume in milliliters of exhalation at 10-ms intervals (volume–time curve), and were preprocessed closely following the procedures in ref. 6. To generate flow–time curves, we approximated the first derivative of volume with respect to time by taking a finite difference in the volume–time curves. We normalized the volume–time and flow–time curves to 1,000 time points by either truncating longer curves or by right-padding shorter curves with zero (for flow–time curves) or the final value (for volume–time curves), and removed FEV1, FVC and PEF values in the extreme tail (top or bottom 0.5%) of the observed values and all blows that failed to meet the acceptability provided by UKB data field 3061. We used the first acceptable blow of an individual when there was more than one. In addition, we dropped all flow curves whose values don’t fall in (−10, 20), all volume curves whose values are not in (−5, 10) and all flow curves in which the proportion of nonzero values is less than 20%. Finally, we generated flow–volume curves from volume–time and flow–time curves by interpolating 1,000 evenly spaced volume values between 0 and 6.58 l (the maximum observed volume in the dataset).

We then subdivide all European GIA individuals processed this way into an 80% training set and a 20% validation set similar to ref. 6. After additionally removing related individuals, there are 259,692 individuals in the training set and 65,266 individuals in the validation set (Extended Data Fig. 1).

Asthma and COPD statuses were determined by medical records using self-report, International Classification of Diseases (ICD)-9 and ICD-10 codes as defined in ref. 6.

UKB data preparation for PPGs

PPGs from UKB were sourced from the data field 4205, which contains the arterial stiffness pressure curve. Each waveform is actually a single pulse with 100 points. Then we computed the minimum, maximum, mean and median distribution values of PPG. We keep the PPG when all four statistics fall in 0.1 and 99.9 percentiles of the related statistics values of all PPGs. We then subdivide all European GIA individuals processed this way into an 80% training set and a 20% validation set. After additionally removing related individuals, there are 112,730 individuals in the training set and 28,545 individuals in the validation set (Extended Data Fig. 4).

HTN status was determined by medical records using self-report, ICD-9 codes (401.* and 405.*) and ICD-10 codes (I10 and I15.*). SBP was determined by automated reading, and data field 4080 was used in UKB.

Convolutional VAE model architecture and training

To generate SPINCs, we encode the flow–time and volume–time curves. In our VAE, we use one-dimensional (1D) convolutional layers to use the temporal context of this time series, encoding the two curves in two channels. In the encoder, we first apply three 1D convolutional layers, each followed by max pooling. We use three fully connected layers to generate the mean and variance of the bottleneck layer. We use five latent dimensions, identical to the number of EDFs, and each latent coordinate is sampled from the Gaussian distribution with the learned means and variances. The decoder architecture is a mirror image of the encoder. We start with three fully connected layers followed by transpose convolution layers, each prepended by an upsampling layer (see Extended Data Fig. 10 and SPINCs model architecture in Supplementary Note for full details).

For RSPINCs, we encode the flow–volume curve alone, and we apply the same sequences of convolutional and fully connected layers as we did for SPINCs, while using only two latent dimensions in this case. We chose to use two latent dimensions for the encoder based on REGLE’s strong reconstruction performance (Extended Data Fig. 2) while maintaining a comparable number of total latent dimensions to SPINCs. Notably, we use a modified VAE architecture to concatenate the five EDFs directly to the sampled output of the bottleneck layer (the layer right before the decoder) to learn only the residual signals not represented by the EDFs (Fig. 2a). As a result, the encoder output dimension is 2, while the decoder input has dimension 5 + 2 = 7 (see Extended Data Fig. 10 and RSPINCs model architecture in Supplementary Note for full details).

For PLENCs, we encode the PPG curves. In our VAE, we use 1D convolutional layers to use the temporal context of this time series. In the encoder, similar to SPINCs, we first apply three 1D convolutional layers, each followed by max pooling, and use three fully connected layers to generate the mean and variance of the bottleneck layer. We use five latent dimensions, identical to the number of EDFs, and each latent coordinate is sampled from the Gaussian distribution with the learned means and variances. The decoder architecture is a mirror image of the encoder. We start with three fully connected layers followed by transpose convolution layers, each prepended by an upsampling layer. RPLENCs are generated similarly to RSPINCs where we inject five EDFs directly into the sampled output of the bottleneck layer (see Extended Data Fig. 10 and PLENCs model architecture and RPLENCs model architecture in Supplementary Note for full details).

All models are trained using the standard VAE loss function consisting of the reconstruction loss and the (rescaled) Kullback–Leibler (KL) divergence loss. For RSPINCs, the KL divergence loss is only applied to the learned encodings, not to the injected EDFs. For optimization, the Adam optimizer55 is used with varying learning rates and batch sizes. No learning rate scheduler was used. After training for 100 epochs, the final learning rate and batch size values (hyperparameters) for (R)SPINCs and PLENCs were chosen to minimize the VAE loss in the validation set (Supplementary Note and Supplementary Table 1).

After training SPINCs, RSPINCs and PLENCs models, we use the encoders of the trained models to generate the encodings for each individual, using the mean value of the learned Gaussian distribution of the encodings. It is worth mentioning that the learned variance for VAE is not used.

All models were implemented in TensorFlow V2 (ref. 56).

Principal components (PCs) and cubic spline coefficients

As baseline methods for dimensionality reduction, we performed PCA and cubic spline fitting on spirograms. For PCA we concatenated volume–time and flow–time curves and used both as inputs, while for cubic spline fitting, we used only volume–time curves as cubic splines perform better for ‘smoother’ curves. To match the number of EDFs and the dimension of SPINCs, we generated five PCs and five cubic spline coefficients. We used one knot and cubic curves for spline fitting to generate exactly five coefficients, where the knot position was chosen at the 20% position to better capture the complexity at the beginning of the volume–time curves. We used scikit-learn (v1.0.2) for PCA and SciPy (v1.9.3) for spline fitting.

Phenotypic correlation analysis

To residualize EDFs and/or covariates from (R)SPINCs and PLENCs, we used ordinary least squares linear regression. To compute the correlation of the EDFs-and-covariates-residualized (R)SPINCs and PLENCs with the tabular fields in UKB, we first preprocessed the tabular fields to remove special codes, normalize, impute and aggregate the values and then finally transformed the categorical fields into one-hot encodings. For each correlation analysis between a feature and one of the (R)SPINCs and PLENCs, we computed the Pearson correlation and the P value with a two-sided alternative hypothesis.

Survival analysis

We performed an analysis of overall survival for European GIA individuals in the spirometry (n =6 5,266) and PPG (n = 28,545) validation sets using the time from first assessment (field 53) to death from any cause (field 40000). Participants who were not known to have died were right-censored at the date of UKB data ingestion (18 December 2020). We quantified the association between overall survival and each SPINC, RSPINC, PLENC, RPLENC and EDF per s.d. using the hazard ratio, which was estimated from a Cox proportional hazards model adjusting for age and sex. The proportional hazards assumption, with respect to each SPINC, RSPINC, PLENC, RPLENC and EDF, was assessed using the Schoenfeld residual test. After stratifying patients into quartiles using each SPINC, RSPINC, PLENC, RPLENC or EDF coordinate, the overall survival curves were constructed using the standard Kaplan–Meier estimator with bootstrapped 95% CIs.

GWAS and PRSs

GWAS on all spirograms EDFs, SPINCs and RSPINCs were performed using BOLT-LMM (v2.3.6)35,36, adjusting for age, sex, age2, age × sex, height, height2, body mass index, smoking status, the number of packs of cigarettes smoked per year, the type of genotyping array and the top 15 genetic PCs as covariates. GWAS on all PPG EDFs, PLENCs and RPLENCs were performed using BOLT-LMM35,36, adjusting for the same covariates as SPINCs while excluding smoking status and the number of packs of cigarettes smoked per year.

All GWAS were restricted to European GIA individuals to minimize confounding. For QC we kept variants with minor allele frequency ≥0.001, imputation INFO score ≥0.8, missing call fraction ≤0.05 and Hardy–Weinberg equilibrium P value ≥ 10−10, among all genotyped and imputed variants provided by UKB. After GWAS, we performed Stratified Linkage Disequilibrium Score Regression57 to estimate SNP heritability and detect potential confounding. GWS ‘hits’ were defined as the most significant variants with P ≤ 5 × 10−8 and independent at r2 < 0.1 using the PLINK --clump command. A reference panel for linkage disequilibrium (LD) calculation contained 10,000 unrelated European GIA samples from the UKB. Significant ‘loci’ were created based on the span of reference panel SNPs in LD (r2 ≥ 0.1) with the hits. Loci separated by fewer than 250 kb were subsequently merged.

While performing GWAS, PRSs for all traits were computed using the --predBetasFile option of BOLT-LMM. While GWAS was performed on individuals with valid spirometry measurements, we evaluated the performance of the PRS in a separate set of individuals not used for GWAS. More specifically, we use the following model to predict the ith individual phenotype. To estimate the PRS weight of the i-th individual for the kth latent embedding (sik), we use the following model:

$${s}_{\mathrm{ik}}=\mathop{\sum}\limits_{j=1}^{M}{g}_{\mathrm{ij}}{\hat{\beta }}_{\mathrm{jk}},$$

where gij is the ith individual genotype at the jth variant and M is the total number of variants or SNPs. \({\hat{\beta }}_{\mathrm{jk}}\) is the effect size estimated by BOLT-LMM for jth variant and kth latent dimension. Next, we estimate the ith individual phenotype of interest as follows:

$${y}_{i}=\mathop{\sum}\limits_{j=1}^{T}{s}_{\mathrm{ik}}{w}_{k},$$

where wk is the linear effect size estimated via an in-house linear model. In all cases that we have five latent embeddings, we set T to 5.

To determine the known lung function loci from previous literature, we extracted all significant loci from ref. 19 by downloading the full GWAS summary statistics and merging hits using the exact same criteria and P value threshold described above, and searched for lung function-related keywords in the NHGRI-EBI GWAS Catalog (v1.0.2-associations_e106_r2022-07-09)39. We used the following keywords (case insensitive) for the catalog search: ‘asthma’, ‘chronic obstructive pulmonary disease’, ‘copd’, ‘expiratory flow’, ‘fev1’, ‘forced expiratory’, ‘forced vital capacity’ and ‘lung function’. To determine the known cardiovascular function loci from previous literature, we used the following keywords (case insensitive) for the NHGRI-EBI GWAS Catalog search: ‘arrhythmia’, ‘afib’, ‘atrial fibrillation’, ‘coronary artery disease’, ‘stroke’, ‘heart attack’, ‘myocardial infarction’, ‘heart failure’, ‘mace’ and ‘rheumatic heart disease’.

Statistical power via expected chi-square statistics

We used expected chi-square statistics (E(χ2)) for all variants or known GWAS Catalog variants related to lung or cardiovascular traits as a measure of statistical power35,36. We computed the chi-square statistics for a given variant for a set of phenotypes with extremely low correlation (for example, methods such as PCA and REGLE) by summing the χ2 for all phenotypes while incorporating the degrees of freedom equal to the number of phenotypes (for example, degrees of freedom of five for SPINCs and PLENCs, five for PCA with five PCs and four when we have used four PCs). Then, we computed the expected chi-square statistics (E(χ2)) for all or a subset of variants (for example, variants associated with lung function in the GWAS Catalog).

Respiratory diseases and cardiovascular traits on multiple datasets

COPDGene dataset

COPDGene is a study of 10,300 current and former smokers with and without COPD, self-reported non-Hispanic white and African Americans, without known lung diseases other than COPD and asthma (dbGaP accession: phs000179.v6.p2). Additional study details, the study protocol and details of genotyping have been previously published41,58, and additionally detailed at copdgene.org. We used the provided variant calls in VCF files and imputed the variants to the Haplotype Reference Consortium (HRC) reference panel using Michigan Imputation Server59, resulting in 39,127,678 total variants. COPD cases were determined using the Global Initiative for Chronic Obstructive Lung Disease (GOLD) criteria, where GOLD stage 2 or higher was considered as cases. Among 6,576 non-Hispanic white individuals, we had access to 1,131 (17%) asthma cases and 2,781 (42%) COPD cases, and the rest of the individuals were used as controls. Meanwhile, among 3,140 African American individuals, 760 (24%) were asthma cases and 802 (26%) were COPD cases. We used the blood pressure measurements and the ‘high blood pressure’ variable included in the dataset to define SBP and HTN traits.

EPIC-Norfolk dataset

The EPIC-Norfolk is a general population-based cohort study of men and women aged 40–79 years living in Norfolk, UK and recruited from general practices between 1993 and 1997. EPIC-Norfolk cohort participants were linked annually to nationally held hospital records and death certificates from 1999 to 2019 using UK National Health Service numbers. COPD was defined as any hospital admission or cause of death coded 490–492, 494–496 (ICD-9) or J40–J44, J47 (ICD-10). Asthma was similarly defined using codes 493 (ICD-9) or J45, J46 (ICD-10). HTN was defined using hospital records and death certificates for ICD codes 401.*, 405.* (ICD-9) and I10, I15.* (ICD-10). The SBP was determined using the continuous SBP from the EPIC-Norfolk health examination at baseline, which is the time point with the highest number of individuals. In a small set of participants who do not have a baseline blood pressure measurement, we used blood pressure measured at the earliest subsequent health examination. Blood pressure was measured at two time points during the examination, with the mean used for analysis.

eMERGE III dataset

We use the following five consent groups that do not require IRB approval: General Research Use (GRU), Health/Medical/Biomedical-Genetic Studies (HBM-GSO), Health/Medical/Biomedical (HMB), Health/Medical/Biomedical (MDS) HMB-MDS and Health/Medical/Biomedical (PUB, GSO) (HMB-PUB-GSO; dbGaP accession: phs001584.v2.p2). We have access to 1,038 asthma cases and 7,250 controls for European GIA, while in the case of African GIA, we have access to 649 asthma cases and 1,367 controls. We used the 39,131,578 variants that are imputed to the HRC reference provided by dbGaP60. Asthma and SBP traits were defined using the corresponding variables in the dataset. HTN was defined using two variables, ‘CASE_CONTROL_CKD_T2D_HTN’ and ‘CASE_CONTROL_CKD_T2D’, where the individuals in the former group but not in the latter group were defined as the HTN cases. Note that this limited our analysis to hypertensive individuals without chronic kidney disease or type 2 diabetes.

Indiana Biobank dataset

The Indiana Biobank is a state-wide collaboration that provides centralized processing and storage of specimens that are linked to participants’ electronic medical information via the Regenstrief Institute at Indiana University. COPD was diagnosed by using ICD-9 (491, 492 and 496) and ICD-10 (J41, J42, J43 and J44). Asthma was diagnosed by using ICD-9 (493) and ICD-10 (J45 and J46). Cases were defined as having at least one in-patient diagnosis or two out-patient diagnoses. Those participants who did not have any diagnoses were defined as controls. Thus, we have 1,445 COPD cases and 3,808 controls, while we have 1,171 asthma cases and 4,083 controls. Among 5,253 individuals for COPD evaluation, 3,797 were of European GIA, 1,371 were of African GIA and 85 were of Hispanic ancestry. Among 5,254 individuals for asthma evaluation, 3,805 were of European GIA, 1,362 were of African GIA and 87 were of Hispanic ancestry. Indiana Biobank samples used in this study were genotyped using the Illumina Infinium Global Screening Array by Regeneron. SNPs with missing rate >5%, minor allele frequency ≤1% and Hardy–Weinberg equilibrium P value <1 × 10−10 among cases and <1 × 10−6 in controls were excluded as reported previously43. Genotyping data were imputed to 1000 Genomes using the Michigan Imputation Server59. Imputed variants with r2 < 0.30 and minor allele frequency < 1% were excluded. PLINK61,62 was used to calculate PRS using imputation dosages.

Functional significance of discovered loci

We ran GREAT (v4.0.4)38 on the human GRCh37 assembly to perform functional enrichment analysis of SPINCs, RSPINCs, PLENCs, RPLENCs and EDF loci. We used the default ‘basal + extension’ region–gene association rule with 5 kb upstream, 1 kb downstream, 1,000 kb extension and curated regulatory domains included. Furthermore, we ran GARFIELD (v2)37 with default parameters to perform tissue-specific analysis where we used 424 DNase I hypersensitive site hotspot annotations provided by the GARFIELD authors37.

Genetic phenome-wide association study

To compute PheWAS, we downloaded GWAS summary statistics for 7,221 phenotypes from the Pan-UKB consortium (20200615 release; https://pan.ukbb.broadinstitute.org). After restricting to phenotypes that contained European GIA statistics and did not persistently fail in LD clumping, we were left with 7,145 pruning + thresholding (P + T) PRSs generated by PLINK (https://www.cog-genomics.org/plink1.9) using the --clump command with an index variant significance threshold of 5 × 10−8 and LD threshold of 0.1, with LD computed from a random subset of 10,000 European GIA individuals in UKB.

SPINCs, RSPINCs and PLENCs P + T PRSs were computed analogously to the Pan-UKB PRSs. We computed the Pearson correlations between the PRSs derived from latent dimensions and the PRSs derived from Pan-UKB phenotypes and the P values with a two-sided alternative hypothesis.

Statistics and reproducibility

All codes necessary to reproduce the results in this work are available on our GitHub repository. No statistical method was used to predetermine the sample size. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment. We removed samples with no valid blows from our spirogram analyses. To QC the blows, we drop any blow if one of FEV1, FVC and PEF values is in the extreme tail (top or bottom 0.5%). We dropped all flow curves whose values don’t fall in (−10, 20), all volume curves whose values are not in (−5, 10) and all flow curves in which the proportion of nonzero values is less than 20%, assuming these blows are likely noisy. We also removed blows that failed the acceptability (valid) provided by UKB. We treated a blow as valid if the value for the UKB field 3061 is 0 or 32. For PPG analysis, we removed outliers defined by any of the four statistics (for example, minimum, maximum, mean and median) falling in the extreme tails of the distribution (top or bottom 0.1%).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.