Background

Recently, genetic studies involving a polygenic risk score (PRS) have dramatically grown, and sophisticated tools and methodologies are being developed for its use [1,2,3]. Along with heritability, PRS has become an important metric for explaining complex genetic diseases (e.g. Alzheimer’s disease, AD) and traits [4,5,6,7,8]. A typical PRS study involves both the discovery and test phases [9, 10]. In the discovery phase, two different methods are used to develop PRS. PRS is essentially derived from the raw genetic data, denoted as rPRS. Instead, the summary statistics from large-scale genetics studies or GWAS catalogs are also used, which we abbreviate as sPRS. Polygenic prediction performance is then evaluated by the marginal contribution of the PRS term in a regression model on target clinical application [10, 11].

An underlying assumption of PRS models is that the subjects from the discovery set do not overlap with those of the test set [10], which are well-known basic prerequisite for PRS studies. However, our preliminary analyses (Fig. 1A(i)) demonstrate a significant number of identical subjects across multiple genetic datasets. The overlapping subjects may be identified and removed for rPRS using the raw genetic data, a challenge remains for sPRS in which raw data is inaccessible. Therefore, we posit that a strict level of independence across datasets is hard to achieve with sPRS. This may pose serious issues to related fields, since the subject-level dependence across datasets may not only inflate the polygenic prediction performance, but also prevent generalizable applications of the developed model.

Fig. 1
figure 1

Overview of the study. (A) (i) Overlapping subjects are observed between AD genetic initiatives. (ii) There is no overlapping subject across ethnicities. Until now, trans-ethnic applications of PRS have been limited. We suspect that subject overlap within an ethnicity is one of the key factors to explain overestimated performances, which motivates this study. We divide PRS into two cases, where rPRS represents when the genetic information is provided and used as the discovery set and sPRS stands for the case when GWAS is pre-conducted and only summary statistics are provided. (B) For rPRS, overlapping subjects (n = 432) between ADSP and AMP-AD are identified, which breaks the independence assumption and causes the overestimation bias. For sPRS, the overlapping ratio cannot be examined by giving the summary statistics. However, the suspected inflation in the AD prediction performance (denoted by sPRS - rPRS) motivates further analysis of the scale effect of the datasets because IGAP has a larger number of samples. (C) (i) Two new variables, hypertension and height, from the UK Biobank database are introduced to compute the upper bounds of the scale effect. Hypertension and height have a higher heritability than AD. Thus, they act as the upper bounds for AD over PRS performances (shown in the QQ plot). (ii) In AD, the gap between sPRS and rPRS (area shaded in green) is attributable to either the overestimation bias or the scale effect of the sample size of the discovery set. Because UK Biobank consists of a larger number of samples (n = 342,318), the scale effect can be measured via computing the performance gains per sample unit. Cohort case counts and their percentages of the total were as follows: ADSP had 5687 (55.2%), AMP-AD had 696 (61.4%), IGAP had 17,008 (31.4%), and UK Biobank had 82,719 (24.2%)

Among prior studies, we identify multiple signs of potential inflation in PRS performance attributable to overlapping subjects. First, if the independence between discovery and test sets is clearly stated in the paper, the PRS performance of binary traits not statistically significant, while if not, results were highly variable or inflated [12,13,14,15,16]. For instance, for models in which independence is explicitly controlled during development, even with a large-scale national biobank, PRS contributed less than 2% in the model accuracy [17]. As a similar line of evidence, in a large-scale finnish study [12], polygenic predictions were not significant for datasets in which the independence is guaranteed, while in other groups without the guarantee performed significantly higher. Second, we argue that the low portability PRS in trans-ethnic applications may serve as another evidence of overlapping subjects developed for within-ethnic models. Low trans-ethnic portability, yet is not fully understood, has been attributed to different linkage disequilibrium (LD) structures, allele frequencies, and marginal effect size variations according to ancestries [18]. We suspect that the strictly preserved independence of subjects between different ethnicities (Fig. 1A(ii)) limits the prediction performance of trans-ethnic models (Fig. 1B). In other words, it is plausibly the level of dependence (i.e., overlapping subjects) between datasets that is one of the reasons for the gap between within- and trans-ethnic generalization capacities of PRS models.

The clues mentioned above, albeit circumstantial, led us to a systematic investigation for detecting and quantifying the overestimation bias in sPRS due to overlapping subjects. On AD prediction, we first prove that PRS models overfit to overlapping subjects, resulting in overestimated prediction performances on the test set. Then we extend our experiments using UK Biobank data to derive the scale effect of the inflation and brief guidelines for detecting the bias in sPRS.

Results

Overview of the study

Figure 1 illustrates an overview of our study design. We design an rPRS model that derives SNPs from ADSP, which is then replicated for all subjects from AMP-AD (see Fig. 1B for details and Fig. 2A for results). After removing subjects from AMP-AD with close kinship with the ADSP study, predictions are made again on AMP-AD, and the two results with and without close subjects are compared. We also compare rPRS and sPRS. To this end, ADSP is divided into 9:1 (discovery: test) splits for ten-fold cross-validation for rPRS (Fig. 2B), while AMP-AD data are used as another test set for rPRS (Fig. 2C). An increasing degree of overlapping bias is observed with an expanding number of subjects in the test set being replaced by samples from the corresponding discovery set (Fig. 2D). We further demonstrate that sPRS also overestimates prediction performances (Figs. 1B and 2B, and Fig. 2C). We compare the sPRS prediction results against rPRS to indirectly infer the level of overfitting. However, the number of subjects in the IGAP study is larger than that of ADSP, and PRS predictions may not be directly comparable due to the scale effect, where a larger discovery set may result in better generalization capability.

Fig. 2
figure 2

PRS performance comparisons for Alzheimer’s disease. ΔAUC and ΔR2 denote the additive gain from introducing PRS term to Model II (refer to Materials and Methods for details). For convenience, we abbreviate the discovery and test sets as D and T, respectively. (A) AD prediction performances with and without subject overlap (D: ADSP, T: AMP-AD). All metrics of overlapping subjects are overestimated, growing in an increasing number of SNPs. (B) sPRS (D: IGAP, T: ADSP) is compared to rPRS (D: ADSP, T: ADSP). (C) AMP-AD data is another T for rPRS (D: ADSP) and sPRS (D: IGAP). D and T of ADSP data are derived from tenfold cross-validation. In both (B) and (C), sPRS performances are significantly higher than rPRS, and we suspect that some participants of IGAP are identical to a subset of ADSP or AMP-AD. (D) A simulated study is conducted with rPRS (D: ADSP, T: AMP-AD), in which a subset of D replaces a growing number of subjects in T (see Results for details). The number of SNPs in the x-axis denotes number of the LD pruned SNPs selected in the order from the lowest P-value thresholds. That is, the lower number of SNP in the left side means the stricter P value threshold and the right-most side is the most generous P value threshold (P < 0.5)

To adjust for the scale effect, we leverage a large number of samples in UK Biobank in terms of two phenotypes with higher heritabilities than AD [19,20,21], namely hypertension, and height, which are binary and non-binary variables. With a varying number of subjects in the discovery set, the rate of change per subject in rPRS accuracies is inferred. Finally, we estimate the level of overestimation bias in sPRS for AD prediction (Figs. 1C and 3).

Fig. 3
figure 3

PRS performance comparisons via UK Biobank. In this study, UK Biobank’s primary purpose is to evaluate the scale effect, defined as the marginal gain of performance due to the size of the discovery set. To this end, two variables representative for high heritability, namely hypertension, and height, are analyzed. For experimental purposes, we intentionally design three discovery sets with different sizes, 300k, 60k, and 9k, which approximately correspond to the discovery set sizes of the full UK Biobank dataset, IGAP, and ADSP, respectively. For convenience, we abbreviate the discovery and test sets as D and T. ΔAUC and ΔR2 denote the additive gain from introducing the PRS term to Model II (refer to Materials and Methods for details). (A) A larger D size results in higher prediction performances (ΔAUC and ΔR2), demonstrating the scale effect as hypothesized. However, in the three sample sizes, a smaller subset of T rarely degrades ΔAUC or ΔR2, but it had an impact on the significance level P, perhaps intuitively. As the highest heritability (Fig. 1C) foretells, the height variable applied in PRS showed a greater impact on the prediction model than hypertension, as indicated by higher ΔR2 and –log(P). (B) When the number of SNPs varies with 100% of T used, most metrics show improvements until 50k SNPs are used, which plateaus. The number of SNPs in the x-axis denotes number of the LD pruned SNPs selected in the order from the lowest P-value thresholds. That is, the lower number of SNP in the left side means the stricter P value threshold and the right-most side is the most generous P value threshold (P < 0.5). (C) Although the size of D with 100% of T used shows a linear correlation with PRS performances, proving the hypothesized scale effect, the improvements are not dramatic. For instance, ΔR2 increases by approximately 0.0000125 and 0.0000083 per 3k of D

PRS prediction performance after excluding genetically related individuals

432 identical subjects overlap between ADSP and AMP-AD (Fig. 1B). Using ADSP as the discovery set, rPRS on all subjects in AMP-AD results in ΔAUC of 0.069 (P = 1.51 × 10–10). After removing the overlapping individuals from AMP-AD, the ΔAUC decreases to 0.0017. Notably, ΔAUC loses its statistical significance (P = 0.57). ΔR2 shows a similar level of deflation, which drops from 0.11 to 0.0041 by removing the identical subjects (Fig. 2A). PRS performances are only slightly affected when close relatives are removed by applying a lower cutoff of PI_HAT (Supplementary Table 2), and ΔAUC still shows no statistical significance.

rPRS and sPRS Performances on AD prediction

Figure 2B and C, and Supplementary Table 3 show the comparison results of rPRS and sPRS. ADSP data are divided into the discovery and test datasets with 9:1 cross-validation for assessment of rPRS (Fig. 2B). Another test of rPRS is evaluated on non-overlapping data of AMP-AD (n = 692) (Fig. 2C). We compute sPRS using summary statistics derived from the first stage of IGAP study (Fig. 2B and C). sPRS on ADSP is evaluated on 10 test folds (i.e., sets), while, on AMP-AD, the whole data (n = 1133) are regarded as a test set. Unlike rPRS, sPRS is evaluated on all AMP-AD data as overlapping subjects are not identifiable against IGAP.

In rPRS, a respective set (ΔAUC, ΔR2) of ADSP and AMP-AD data are (0.0071 ± 0.0052, 0.0077 ± 0.0045) and (0.0013 ± 0.00091, 0.0011 ± 0.0018). P-values are not significant (P > 0.05). In other words, when data independence is guaranteed, PRS for AD displays unexpectedly low performance. In sPRS, a respective set (ΔAUC, ΔR2) of ADSP and AMP-AD are as high as (0.051 ± 0.013, 0.063 ± 0.015) and (0.060, 0.086). The results of sPRS are significantly inflated in comparison to those of rPRS.

PRS performances are sensitively affected by dependency

Given the suspected inflation in sPRS, we wanted to show how the PRS results are sensitively affected according to the subject overlap, as well as estimate the number of overlapping subjects in sPRS. To this end, we simulate an increasing number of subjects from the discovery set to be added into the test set. All ADSP data (n = 10,293) are used to derive an rPRS model, which subsequently is evaluated on a mixture of AMP-AD (independent test set, n = 692) and ADSP data randomly selected from test splits of ten-fold cross-validation (fully dependent test set, n = 692), for which the portion of the latter increases from 0 to 100% via an increment of 10%. sPRS is derived from the first stage data of the IGAP study and evaluated in the same way (Fig. 2D). As expected, all ΔAUC, ΔR2, and –log(P) monotonically increases in a growing portion of subject overlap. sPRS performances remained relatively unchanged while maintaining the inflated values greater than those in independent rPRS (Fig. 2B and C). To show that the AD characteristics of the test sets are maintained, performances of APOE ε4 are also displayed. When only APOE ε4 status is included for developing rPRS in the same manner, performances are relatively stable compared to rPRS, which indicates that the characters of AD are maintained irrespective of the AMP-AD and ADSP combination. Judging by the intersection of two lines representing rPRS and sPRS trends, we infer at least 10% of participants in AMP-AD or ADSP are included in IGAP. However, this holds true only if the number of subjects in discovery sets is equal. Meanwhile, the number of subjects in IGAP is five times larger than that of ADSP. Therefore, to confirm the suspected inflation in sPRS, we must investigate the scale effect of the discovery set.

Upper bounds of PRS performance derived from UK Biobank data

UK Biobank data are leveraged to infer the upper bounds for the scale effect of sPRS. Hypertension and height are selected as two target variables to investigate the scale effect due to their notably high heritability (Fig. 1C). We posit that AD prediction scores are bounded by both hypertension and height thanks to superior heritability and the scale effect. Subjects from UK Biobank are split by 9:1 following the ten-fold cross-validation scheme.

To investigate the scale effect of AD, we evaluate rPRS on three different sizes of the discovery set, corresponding to 9k, 60k, 300k (i.e., full data), and roughly equal to the size of ADSP, IGAP, and UK Biobank, respectively (Supplementary Table 4). Figure 3 summarizes and visualizes the results. For hypertension, with a discovery set size of 60k subjects, a set of metrics (ΔAUC, ΔR2) is (0.0033 ± 0.00047, 0.0033 ± 0.00051). When the size is 300k, (ΔAUC, ΔR2) is (0.012 ± 0.0014, 0.012 ± 0.0017), which is still significantly smaller than the sPRS of ADSP and AMP-AD (Fig. 2B and C) corresponding to (0.051 ± 0.013, 0.063 ± 0.015) and (0.060, 0.086), respectively.

For height, ΔR2 for 60k and 300k sizes are 0.028 ± 0.0015 and 0.075 ± 0.0056 (Fig. 3A and Supplementary Table 5). Relatively larger contributions of PRS in height compared to hypertension reflect the greater heritability. In both variables, PRS scores plateau at approximately 50k of SNPs (Fig. 3B). The results of sPRS using IGAP (54k) as a discovery set are highly inflated compared to those of hypertension and height at a similar scale (60k). Scores display growth linear or sublinear to the size of the discovery set (Fig. 3C).

To obtain statistical significance for ΔAUC, both the discovery and test sets require a substantially large number of subjects (Fig. S1). For instance, for a 60k-sized discovery set, more than 10k subjects are needed in the test set for sufficient power (P < 0.01). Lassosum [2], which uses the LD information, shows two-fold higher performance than PRS (Supplementary Table 6), but the linear pattern of the scale effect remains unchanged.

The results can vary based on the case:control ratio of the discovery set. The IGAP study consisted of 17,008 cases and 37,154 controls, yielding a case:control ratio of 1:2.18, while the UK Biobank study had a case:control ratio of 1:3.14 (Table S1). Thus, to demonstrate that the lower PRS performances aren’t a result of the case:control ratio, we kept the number of cases constant at 17,008, and varied the case:control ratio to 0.33, 1, and 3 (Fig. S2). The results from a case:control ratio of 1:3 are almost comparable to those observed with 20% of the discovery set, as depicted in Fig. 3. Furthermore, we examined the results according to both the MAF of the SNPs and the number of SNPs. The outcomes did not significantly differ across the range of the SNP’s MAF interval (Fig. S2A). As for the results according to the number of SNPs, these were saturated from the outset and showed minimal variation (Fig. S2B and Fig. 3B). When the number of cases is held constant, the highest performance was observed at a case:control ratio of 1:3 (Fig. S2B). Therefore, if the number of cases in the discovery set remains fixed, the larger the total subject count, the higher the performance of the PRS.

Discussion

This study illuminates the overestimation bias in PRS studies. In AD prediction, we prove the presence of latently overlapping subjects for sPRS and demonstrate performance inflation. Developing sPRS without knowledge of the overlapping individuals raises concerns over overestimation and the model’s generalizability. We argue that overestimation needs to be suspected when the contribution of sPRS derived from discovery datasets with much smaller sample sizes than nation-wide biobank, by the referenced PRS methods [22], exceeds 1–2% of ΔR2 for binary traits [17] and 7–8% of ΔR2 for non-binary traits, which are the respective upper limits drawn by hypertension and height from UK Biobank.

As evidenced by the three separate AD studies sharing identical subjects, subject overlap may be prevalent in other cohort studies and large-scale meta-analyses. In our study, ADSP and AMP-AD data include 432 identical subjects, corresponding to 38.12% of AMP-AD data. Also, ADSP and ADNI have 21 overlapping subjects (Fig. 1A(i)). As genetic studies are conducted across multiple centers, the odds of having duplicated subjects are high among different initiatives. Therefore, the planning of PRS studies requires considerable attention to exclude identical persons.

Several existing studies support the relationship between the independence of datasets and the possibility of the overestimation bias. Concretely, we observe the trend in which PRS performs significantly lower when data independence is explicitly controlled. For instance, when overlapping subjects are removed, PRS contributed less than 2% accuracy even with large discovery sets from a national biobank [12, 17]. In a large-scale Finnish study for coronary artery disease (CAD), arterial fibrillation (AF), type 2 diabetes mellitus (T2DM), breast cancer (BrC) and prostate cancer (PrC), the independencies in CAD and AF were clarified [12]. The PRS contributions of the two diseases were statistically insignificant (0.3% and 0.9%, respectively). In contrast, the PRSs of T2DM, BrC, and PrC used the summary statistics derived from large meta-analyses which included the finnish population and the contributions of PRS were significantly high (2%, 3.9%, and 2.9%, respectively).

We propose two potential signs when the overestimation bias should be suspected. The first is when a PRS model achieves surprisingly high performance without sufficient participants. The number of subjects and SNP heritability are important factors in PRS performance [19]. For example, hundreds of thousands of subjects in the discovery set would be required for PRS to be used in disease prediction [19]. Computing the PRS using UK Biobank for traits with prominent heritability sets the upper bounds for other traits with lower heritability. Thus, we reveal that a test set of 10k is required in the 60k discovery set for statistically significant AUC increment by PRS. In the studies registered in the PGS catalog, the median number of subjects in the test sets is 6995 and 24,573 in the discovery sets [23]. In addition, a systematic review of AD PRS studies reveals that the test set size ranges from 59 to 116,666 [24]. About one-third of them, the sample size is less than a thousand. Therefore, an abundance of prior sPRS studies [24,25,26] may not suffice as the number of samples required for statistically significant results. ​.

Second, for particular phenotypes, high variability in performances across different nations or races using the same discovery set may be another sign of overestimation. PRS performances plummet if the discovery and test sets are from different countries or ethnicities [12,13,14,15,16]. However, a line of evidence suggests that trans-ethnic portability remains in many traits [27,28,29], even if the inherent differences in LD structures across races prevent causal variants from being correctly reflected in PRS [30]. For instance, Martin et al. reported equivalent performances within the confidence interval for intra- and trans-ethnic test sets in four out of five binary traits and 11 out of 17 non-binary traits of BioBank Japan [27]. Also, sPRS developed with Europeans showed a 50% discounted performance for East Asians and 25% for Africans [27]. The gap (or the variability) between intra-ethnic and trans-ethnic evaluations of PRS can be falsely increased by overestimation in a specific ethinic group study. The low trans-ethnic portability of PRS can be understood and overcomed only after excluding overestimation bias. That is, if the performance for a trans-ethnic application is preserved at less than 25% for intra-ethnic evaluation, overlapping bias might be suspected. In other words, sPRS suffers in trans-nation or trans-ethnic studies since the subject-level independence strictly holds.

PRS also could be developed using GWA (P < 5 × 10–8) SNPs in core genes curated from multiple GWA studies [31,32,33], which we argue are not free from the overestimation bias. Using GWA SNPs is justified because it substitutes the P-value thresholding step required in conventional PRS studies for selecting SNPs. Also, GWA SNPs tend to show highly significant P-values and therefore are regarded as reliable. Moreover, GWA SNPs information can be conveniently accessible via reviewing prior works, even if the authors do not release summary statistics. However, GWA SNPs are frequently found in the uninterpretable non-coding regions [34], and PRS performances increase with a higher number of SNPs then plateau (Figs. 2 and 3) [35]. Therefore, PRS models from GWA SNPs may overfit to the discovery set. Hence, the odds of overestimation bias are high when the GWA SNPs are selected from multiple studies.

For the non-binary trait height, PRS has a greater contribution than binary trait hypertension. Similar trends have been observed in a prior study [27]. Heritability for height and hypertension was reported as 49.7% and 14.7%, respectively [21]. Although a superior heritability of height partially explains the performance gap, characteristics of each phenotype may also play a role. For instance, while measuring height is straightforward, a diagnosis of hypertension can depend on age. Thus, a subset of the control group can later be diagnosed with hypertension.

One limitation in our study is the indirect derivation of the scale effect to prove the overestimation bias of sPRS in AD prediction. We justify the choice of our methods based on three reasons. First, hypertension and height have higher heritabilities than AD [20, 21, 31]. Second, the number of SNPs used for UK Biobank is larger than those used for AD data analysis. Finally, the number of available subjects in UK Biobank is five-fold larger than IGAP. Therefore, we argue that PRS performances of hypertension from UK Biobank are sufficient upper bounds of PRS studies not only for AD prediction but also for those of most binary complex genetic traits/diseases using subjects of less than national biobank scale.

Conclusion

As the risk of overestimation bias is evaluated in sPRS studies, care must be taken to prevent overlap, especially within the same ethnicity. Direct methods of calculating sample overlap are not always feasible, so indirect methods using summary statistics can be applied [36]. While applications of genetic studies continue to gain momentum and many countries create large-scale biobanks, PRS developed from large meta-analyses that curate and merge data from several countries should be screened in advance to filter out overlapping subjects. Researchers often release summary statistics to help further research [23]. When using summary statistics direct comparisons between a test set and large-scale data are difficult. As such, we showcase both direct and indirect methods to probe overestimation bias within the same ethnicity—either of which, we argue, must be mandatory to improve PRS reliability.

Methods

Participants

In this work, AD genetic studies–International Genomics of Alzheimer’s Project (IGAP), Alzheimer’s Disease Sequencing Project (ADSP), and AMP-AD–are used to demonstrate overestimation bias in PRS [37,38,39,40,41,42,43]. Non-Hispanic white individuals are used, and their cross-study genetic relatedness is revealed by principal component (PC) and identity-by-state analyses. After quality control, our final analyses include 10,293 participants from ADSP and 1,133 from AMP-AD (Supplementary Table 1). In the UK Biobank database [44], 342,318 white-British participants had hypertension and height records. Refer to the Supplementary Material for additional information about the study datasets, sequencing methods, and quality control processes.

Statistical analyses

We perform logistic regressions for binary traits and linear regressions for a continuous phenotype using PLINK (v1.9) [45]. Three different regression models are constructed. First, a simple regression model is used with PRS as the only covariate. Second, Model II denotes a multivariable regression without PRS, consisting of additional covariates highly related to the phenotypes. In Model III, we introduce PRS as an additional covariate to Model II. In both models, we control for 20 leading PCs

$$\text{M}\text{o}\text{d}\text{e}\text{l} \text{I}: \text{y}={\beta }{\text{x}}_{\text{P}\text{R}\text{S}}$$
$$\text{M}\text{o}\text{d}\text{e}\text{l} \text{I}\text{I}: \text{y}={\text{x}}_{\text{c}\text{o}}^{\text{T}}{\text{a}}_{\text{c}\text{o}}+ {\text{x}}_{\text{P}\text{C}}^{\text{T}}{\text{b}}_{\text{P}\text{C}}$$
$$\text{M}\text{o}\text{d}\text{e}\text{l} \text{I}\text{I}\text{I}: \text{y}={\text{x}}_{\text{c}\text{o}}^{\text{T}}{\text{a}}_{\text{c}\text{o}}+ {\text{x}}_{\text{P}\text{C}}^{\text{T}}{\text{b}}_{\text{P}\text{C}}+{\beta }{\text{x}}_{\text{P}\text{R}\text{S}}$$
$${\text{x}}_{\text{c}\text{o}}=\left(\begin{array}{c}1\\ {\text{x}}_{\text{c}\text{o},1}\\ \begin{array}{c}?\\ {\text{x}}_{\text{c}\text{o},{\text{n}}_{1}}\end{array}\end{array}\right), { \text{a}}_{\text{c}\text{o}}= \left(\begin{array}{c}{\text{a}}_{0}\\ {\text{a}}_{1}\\ \begin{array}{c}?\\ {\text{a}}_{{\text{n}}_{1}}\end{array}\end{array}\right), { \text{x}}_{\text{P}\text{C}}= \left(\begin{array}{c}{\text{x}}_{\text{P}\text{C},1}\\ {\text{x}}_{\text{P}\text{C},2}\\ ?\\ {\text{x}}_{\text{P}\text{C},{\text{n}}_{2}}\end{array}\right), { \text{b}}_{\text{P}\text{C}}= \left(\begin{array}{c}{\text{b}}_{1}\\ {\text{b}}_{2}\\ ?\\ {\text{b}}_{{\text{n}}_{2}}\end{array}\right)$$

,

Where \({\text{x}}_{\text{c}\text{o}}, {\text{a}}_{\text{c}\text{o}}\in {\mathbb{R}}^{{\text{n}}_{1}+1}\) are vectors of general covariates (e.g., age and sex) and corresponding coefficients, respectively, while \({\text{x}}_{\text{P}\text{C}},{\text{b}}_{\text{P}\text{C}}\in {\mathbb{R}}^{{\text{n}}_{2}}\) are vectors of PCs and PC coefficients, respectively. Here, \({\text{x}}_{\text{P}\text{R}\text{S}}\) denotes the PRS term. Throughout the manuscript, we focus on measuring the additive gain of PRS in Model III, on top of Model II.

For AD datasets, common covariates include sex, APOE ε4 status, and the sequencing centers. PCs are computed using the principal component analysis function of PLINK (v1.9) [45]. For UK Biobank, age, sex, and array types (UK Biobank Axiom array or UK BiLEVE Axiom array) are considered as covariates. Here, we download 40 PCs pre-calculated with fastPCA [46]. For hypertension of UK Biobank, body mass index is additionally included in the covariates. For binary traits, the areas under receiver operating characteristic (AUC) and Nagelkerke’s pseudo-R2 are used to assess model performances, which are calculated using “pROC” and “fsmb” packages of R (v4.0.3), respectively [47, 48]. Performance improvements from PRS are determined by subtracting AUC and R2 of Model II from those of Model III, which we label as ΔAUC and ΔR2. The statistical significance of ΔAUC is examined using DeLong’s test [49]. For height, a non-binary trait, we compute the adjusted-R2 via “lm” in the R program. PRS contributions are determined by comparing Model II and Model III with the extra sum of squares test.

Cross-validation

For ten-fold cross-validation tests, we balance the number of samples between the discovery and test splits based on each covariate in the statistical analyses using “StratifiedKFold” function from Python’s (v3.8) “scikit-learn” (v0.24.1) package [50]. When testing a part of the cross-validated datasets, samples are balanced over covariates using the R (v4.0.3) “sampling” (v2.9) package [51].

Computation of PRS

For computing PRS, we select common (MAF ≥ 1%) SNPs and use summary statistics from discovery sets, followed by measuring PRS in test datasets. For rPRS, we calculated summary statistics by logistic regression. For sPRS, we downloaded summary statistics from IGAP web page (https://www.niagads.org/datasets/ng00036). After selecting SNPs with P < 0.5 in the association tests using the discovery dataset, we perform clumping with the window of ± 1Mbp and r2 < 0.1. Clumping is performed using PLINK (v1.9) [45]. For AD genetic studies, we exclude any SNPs within 1Mbp of the APOE (apolipoprotein E) region. When analyzing the effect of the number of SNPs on the results, the SNPs are selected in the order from the lowest P-value. We construct PRS with PRSice (v2.3) and Lassosum (v0.4.5) [2, 22].