Plasma proteomic signatures of a direct measure of insulin sensitivity in two population cohorts

Aims/hypothesis The euglycaemic–hyperinsulinaemic clamp (EIC) is the reference standard for the measurement of whole-body insulin sensitivity but is laborious and expensive to perform. We aimed to assess the incremental value of high-throughput plasma proteomic profiling in developing signatures correlating with the M value derived from the EIC. Methods We measured 828 proteins in the fasting plasma of 966 participants from the Relationship between Insulin Sensitivity and Cardiovascular disease (RISC) study and 745 participants from the Uppsala Longitudinal Study of Adult Men (ULSAM) using a high-throughput proximity extension assay. We used the least absolute shrinkage and selection operator (LASSO) approach using clinical variables and protein measures as features. Models were tested within and across cohorts. Our primary model performance metric was the proportion of the M value variance explained (R2). Results A standard LASSO model incorporating 53 proteins in addition to routinely available clinical variables increased the M value R2 from 0.237 (95% CI 0.178, 0.303) to 0.456 (0.372, 0.536) in RISC. A similar pattern was observed in ULSAM, in which the M value R2 increased from 0.443 (0.360, 0.530) to 0.632 (0.569, 0.698) with the addition of 61 proteins. Models trained in one cohort and tested in the other also demonstrated significant improvements in R2 despite differences in baseline cohort characteristics and clamp methodology (RISC to ULSAM: 0.491 [0.433, 0.539] for 51 proteins; ULSAM to RISC: 0.369 [0.331, 0.416] for 67 proteins). A randomised LASSO and stability selection algorithm selected only two proteins per cohort (three unique proteins), which improved R2 but to a lesser degree than in standard LASSO models: 0.352 (0.266, 0.439) in RISC and 0.495 (0.404, 0.585) in ULSAM. Reductions in improvements of R2 with randomised LASSO and stability selection were less marked in cross-cohort analyses (RISC to ULSAM R2 0.444 [0.391, 0.497]; ULSAM to RISC R2 0.348 [0.300, 0.396]). Models of proteins alone were as effective as models that included both clinical variables and proteins using either standard or randomised LASSO. The single most consistently selected protein across all analyses and models was IGF-binding protein 2. Conclusions/interpretation A plasma proteomic signature identified using a standard LASSO approach improves the cross-sectional estimation of the M value over routine clinical variables. However, a small subset of these proteins identified using a stability selection algorithm affords much of this improvement, especially when considering cross-cohort analyses. Our approach provides opportunities to improve the identification of insulin-resistant individuals at risk of insulin resistance-related adverse health consequences. Graphical Abstract Supplementary Information The online version contains peer-reviewed but unedited supplementary material available at 10.1007/s00125-023-05946-z.

OGTT. We used plasma samples collected, aliquoted, and stored from the OGTT study day visit for the proteomic measurements, which was also when the clinical phenotypic data was collected through multiple questionnaires and a physical exam was performed to document height, weight, waist/hip/thigh circumferences, bioimpedance estimate of percent fat/fat-free mass using TANITA scale, heart rate and blood pressure. During the EIC, the target plasma glucose concentration was maintained between 4.5 and 5.5 mmol/l and insulin infused at a rate of 240 pmol·min −1 ·m −2 . Bedside plasma (or blood) glucose was measured at 5-to 10-min intervals to ensure it remains within 0.8 mmol/l (±15%) of the target glucose concentration. The steady-state period was between 80 to 120 minutes. ULSAM: For ULSAM, participants followed similar procedures as in RISC but in the context of a return visit for their 20 years follow up exam at the age of approximately 70 years (cohort started when all participants were age 50). The OGTT and the clamp procedure were performed at least 1 week apart. The plasma samples used to measure OLINK proteins were from the day of the clamp procedure. The euglycemic insulin clamp in this cohort was performed with a target plasma glucose concentration of 5.1 mmol/L. The infusion rate of insulin was of 56 mU per minute per square meter of body surface. Bedside plasma glucose was measured at 5min intervals to ensure it remained within +/-0.2 mmol/l of the target glucose concentration. The M-value was calculated between 60 minutes and 120 minutes of infusion. In a subset of 17 men, replicate measurements to estimate measurement errors were performed [1] . The coefficient of variation for the M-value was 9.3%.
We applied an inverse-normal transformation to each protein within each cohort.

Statistical Analyses
We excluded subjects who failed sample quality control or were missing M-values in both cohorts. For ULSAM, we also excluded subjects with prevalent diabetes at the time of their age 70 clinic visit when they underwent the euglycemic clamp. We did not exclude ULSAM participants with prevalent cardiovascular disease at the time of their clamp. We then excluded proteins with a high proportion of missing measurements in either cohort. Sample evaluation was carried out on each plate and a sample plate median value was calculated for the Incubation Control 2 and the Detection Control, respectively. For each sample, the result for each of these internal controls was allowed to deviate no more than Â±0.3NPX from the plate median. If any or both internal controls exceeded the 0.3 NPX limit, the sample failed the QC. If more than 1/6 th of the samples failed the QC, the run was deemed unreliable. The reason for the issues was then be evaluated and (if applicable) samples were rerun (https://olink.com/faq/how-is-quality-control-of-the-data-performed/).
We first performed standard linear regression in the RISC and ULSAM cohorts independently to calculate marginal associations of the M-value with each of the OLINK proteins measured. All association analyses included covariates age, sex, and recruitment center. Then, we ran a second model additionally including body mass index (BMI). The Benjamini-Hochberg False Discovery Rate (FDR) method and a Bonferroni-corrected alpha threshold (adjusting for the final 823 proteins analyzed) were used to identify significant associations.
To choose the LASSO regularization parameter lambda ( ), we used the cv.glmnet function in glment to perform cross validations over a total of 10 folds. The hyperparameter tuning was done using the function glment at alpha=1 (lasso penalty) and applying the one-standard-error rule, using the value of that gave the most regularized model whose cross-validation error was within one standard error of the minimum crossvalidation error (lambda = lambda.1se). In the RISC and ULSAM cohorts separately, models were trained on a randomly selected 70% of the cohort and tested on the remaining 30%. The split into training/testing was done using the createDataPartition function in R using as a reference (a vector of outcomes=y) the M value to have in all the data set created (training and test) the same proportion of cases and controls (1:3) and to avoid any type of case number imbalances.
Our final analysis included a stability selection algorithm to improve the selection process of proteins and to obtain an error control for the number of falsely selected noise variables. This algorithm extends the standard LASSO approach to perform a LASSO regression multiple times on subsamples of the training data and returns a selection probability for each predictor (number of times selected divided by number of regressions done). This type of regularization has advantages in cases where the number of predictors exceeds the number of observations, in selecting variables consistently, demonstrating better error control and not depending strongly on the penalization parameter.
Lasso analyses were conducted with the R software, version 3.3.0 including the glmnet and caret (confusionMatrix function) packages and the randLassoStabSel package to perform the Randomized Lasso Stability Selection. The function uses the 'stabsel' function from the 'stabs' package but implements the randomized lasso version. We followed the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement.

Cohort Characteristics
Plasma protein profiling was attempted in 1037 subjects of the RISC study and all but 64 passed sample QC.
We then further excluded 7 subjects with missing M-values leaving 966 subjects available for analysis. For ULSAM, profiling was attempted in 954 subject and all but 48 passed sample QC. We then further excluded 54 subjects with missing M-values and 107 subjects with prevalent diabetes leaving 745 subjects available for analysis.
We excluded five of the 828 proteins due to a high proportion (> 25%) of missing values in either cohort. Among the remaining proteins, no imputation of values was necessary as no subjects had missing proteins levels. Furthermore, no imputation of clinical variables was necessary as only a very small number of subjects with full protein profiles were missing one or more covariates (n = 7 in RISC).
Of note, no overlap in the range of age exists between the two cohorts with a mean age of 44.4 years (SD,8.3 years) in RISC at baseline and 70.9 years (SD, 0.6 years) in ULSAM. In RISC, 55% of participants were women while in ULSAM all participants were men.

Standard linear regression analyses and replication of marginal effects of proteins
We identified 141 and 136 proteins in RISC and ULSAM, respectively, passing the Bonferroni correction threshold of significance (alpha= 6.1*10 -5 considering 823 proteins tested) for the first model adjusted for age, sex, and center. In the second model additionally adjusted for BMI, these numbers decreased to 69 and 72. Among significant proteins in RISC, 69/141 (48.9%) and 14/69 (20.3%) replicated in ULSAM for the first and second models respectively. Among significant protein in ULSAM, 69/136 (50.7%) and 14/72 (19.4%) replicated in RISC for the first and second models respectively. Resistance (HOMA-IR) index in the LASSO regression models. Table 5. Full cross-tabulations of observed and predicted M-values by class (lowest quartile = case, rest = non-case) and additional diagnostic tests proportions. Table 6. Variance explained (R2) using M as a continuous variable and area under the ROC Curve (AUC) statistic using M as a binary variable excluding proteins with values below the lower limit of detection (LOD) at progressively more stringent cutoffs for removal of proteins. Table 7. Proteins selected by LASSO and their LASSO derived multivariable effect by linear (for continuous M-value) or logistic (for binary M-value) regression for all models tested in the Relationship between Insulin Sensitivity and Cardiovascular disease (RISC) and Uppsala Longitudinal Study of Adult Men (ULSAM) cohorts. Positive effect reflects the mean increase in M value per standard deviation (SD) increase in the level of the protein measured in the plasma, consistent with improved insulin sensitivity. A negative effect is consistent with a mean decrease in M value and insulin sensitivity. Table 8. Number of times a protein was selected among the named set of models. Table 9. Proportion of pairwise correlations of proteins reaching specified thresholds of r (correlation)