Introduction

Gaining insight into the molecular determinants of differences in cognitive abilities is one of the core aims of neurobiological intelligence research. Our ability to “understand complex ideas, to adapt effectively to the environment, to learn from experience, to engage in various forms of reasoning [and] to overcome obstacles by taking thought” has usually been described as general intelligence [1]. Various tests have been designed to measure the cognitive abilities of a person by assessing different aspects of general intelligence, like inductive and deductive reasoning abilities or the amount of acquired declarative knowledge [2]. Whereas most test procedures cover different aspects of general intelligence, there are also tests that focus on specific cognitive abilities. Matrix reasoning tests, for instance, are typically used to assess non-verbal abstract reasoning [3, 4], while other tests measure stored long-term memory of static information like rules, relationships, abstract concepts, and of course, general knowledge [2, 5].

Decades of intelligence research have shown that general intelligence is one of the best predictors of important life outcomes, including educational and occupational success [6, 7] as well as mental and physical health [8,9,10]. Thus, considerable research efforts have been put into place to explore the mechanisms behind interindividual differences in general intelligence. Behavioral genetics has been particularly fruitful for intelligence research. Twin and family studies have demonstrated that general intelligence is one of the most heritable behavioral traits, with heritability estimates ranging from 60 to 80% in adulthood [for an overview see 11, 12]. However, these traditional quantitative genetics studies cannot be used to estimate which and how many genetic variants contribute to heritability. Although as early as 1918, Fisher’s infinitesimal model postulated that complex traits are affected by a large number of genes, it was not until the advent of genome-wide association studies (GWAS) that effect sizes of single nucleotide polymorphisms (SNPs) could be systematically assessed over the genome. In support of Fisher’s infinitesimal model [13], large GWAS on intelligence have shown that even the most strongly associated SNPs explain less than 1% of the variance, and that heritability of intelligence is caused by a very large number of DNA variants of small effect (not taking into account rare mutations with potentially large effects on individuals, but small effects on the population) [14].

Given that GWAS of sufficient sample size can reliably detect very small effects of single common variants, and given that SNPs contribute cumulatively to heritability, a fruitful approach forward has been the use of so-called polygenic scores (PGS). These are genetic indices of a trait, defined as the sum of trait-associated alleles across many genetic loci, weighted by effect sizes estimated by GWAS. Such scores can be calculated for individuals in target samples (independent from the initial discovery GWAS) and be used to predict traits of interest [15]. For instance, PGS for intelligence (IQ-PGS) [16, 17] and cognitive performance (CP-PGS), as well as educational attainment (EA-PGS) [18,19,20], a secondary measure of intelligence, have been associated with a wide variety of traits, including life-course development, educational achievement, body mass index, or emotional and behavioral problems in children [21,22,23]. Although IQ-PGS, CP-PGS, and EA-PGS explain a considerable amount of variance in intelligence (which is thought to increase even further with larger GWAS) [14], and robust and sometimes unexpected associations between genetic indices of cognitive abilities and other traits have been uncovered, it is important to understand that the predictive power of these PGS depends on the cognitive measure that is being used. To reliably identify genetic variants associated with a complex continuous behavioral trait, such as intelligence, in a GWAS, large sample sizes in the 100,000 s to millions are required. This has been successfully achieved using a light-phenotyping approach, that is, performing GWAS on the performance in rather superficial tests of general cognitive abilities [24, 25], or even more crudely, years of education [25]. The question thus arises, which of the various aspects of general intelligence [1, 2] are mainly reflected in those GWAS. The study at hand aimed to tackle this issue by pursuing a deep phenotyping approach. Using an extensive test battery comprised of tests for memory performance, processing speed, reasoning, and general knowledge, we investigated the predictive power of IQ-PGS [24], CP-PGS, and EA-PGS [25] with regard to each of the aforementioned cognitive abilities.

Methods

Sample Size Estimation

A statistical power analysis was performed for sample size estimation based on data reported by Savage et al. [24] and Lee et al. [25]. Both meta-analyses reported effect sizes [26] of R2 = 0.052 (IQ-PSG) [24], and R2 = 0.097 (EA-PGS) [25], respectively. G-power [27] was used to determine the sample size required to detect a small to medium effect size (f2 = 0.08) in a multiple linear regression analysis, using an α error of 0.05 and statistical power of 1-β = 0.90. Sample size estimation predicted that N = 236 participants were needed to obtain the desired statistical power. With a final sample size of N = 518 (see below), our sample was thus adequately powered for the main objective of this study.

Participants

We investigated 557 neurologically and psychologically healthy participants with a mean age of 27.33 years (SD = 9.43; range 18–75 years), including 283 males (mean age: 27.71 years, SD = 9.86 years) and 274 females (mean age: 26.94 years, SD = 8.96 years). The sample was mainly comprised of university students of different majors (mean years of education: 17.14 years, SD = 3.12 years), who either received a financial reward or course credits for their participation. Health status was self-reported by the participants as part of the demographic questionnaire. Individuals who reported current or past neurological or psychological problems were not admitted to the study. The study protocol was approved by the local ethics committee of the Faculty of Psychology at Ruhr University Bochum (vote Nr. 165). All participants gave written informed consent and were treated following the Declaration of Helsinki.

Acquisition and Analysis of Behavioral Data

Behavioral data was acquired during four individual test sessions. Each session was designed as a group setting of up to six participants, seated at individual tables, in a quiet and well-lit room. The tests were administered according to their respective manuals. The following is a brief description of each test procedure used in our study. Please refer to the Supplementary Material for descriptive statistics (Supplementary Material S1) and intercorrelations (Supplementary Material S2) of all cognitive tests.

I-S-T 2000 R

The Intelligenz-Struktur-Test 2000 R (I-S-T 2000 R) is a well-established German intelligence test battery measuring multiple facets of general intelligence [28, 29]. The test consists of a basic and an extension module. The basic module measures different aspects of intelligence and contains 180 items assessing verbal, numerical, and figural abilities as well as 23 items assessing verbal and figural memory. Verbal, numeric, and figural abilities are measured by three reasoning tasks that comprise 20 items each. For instance, verbal intelligence is assed via items on sentences completion, where the participant is asked to complete a sentence with the correct word, or on analogies and commonalities. Numerical intelligence on the contrary comprises items on arithmetic problems, digit spans, and arithmetic operators assessing the mathematical abilities of the participant. Figural intelligence is assessed via items testing for the participant’s ability to assemble figures mentally, to mentally rotate and match dices, and to solve matrix-reasoning tasks. The processing time for this section is about 90 min. Subsequently, the participants complete two memory tasks, one verbal (10 items) and one figural (13 items), where they must memorize a series of words, or pairs of figures, respectively. This takes about 10 min. The extension module measures general knowledge covering a total number of 84 items. The knowledge test covers verbal (26 questions), numerical (25 questions), and figural knowledge (22 questions) and takes about 40 min. Here, the participant’s knowledge on various facets is assessed: art/literature, geography/history, mathematics, science, and daily life. Most of the items of both modules are designed in multiple-choice form. The only exception is two sub-tests on numerical reasoning (calculations and number series). Here, the participant has to directly fill in the answer. The complete testing session takes about 2 h 30 min. The reliability estimates (Cronbach’s α) for the sub-factettes of the basic module fall between 0.88 and 0.95, as well as 0.93 for the memory tasks and 0.96 for general mental ability. The extension module has a reliability of 0.93 (Cronbach’s α). The recent norming sample consists of about 5800 individuals for the basic module and 661 individuals for the extension module. The age range in the norming sample is between 15 and 60 years and both sexes are represented equally.

ZVT

The Zahlenverbindungstest (ZVT) is a trail-making test used to assess processing speed in both children and adults [30]. After editing two sample matrices, the participant has to process a total of four matrices. Here, the individual has to connect circled numbers from 1 to 90 in ascending order. The numbers were positioned more or less randomly within the matrix. The instructor measures the processing time for each matrix. The total test value, reflecting the participant’s processing speed, is then calculated as the arithmetic mean of all four matrices. The test takes about 10 min in total. The reliability between the individual matrices is above 0.86 and 0.95 for adults and 6-month retest-reliability is between 0.84 and 0.90. The recent norming sample consists of about 2109 individuals with an age range between 8 and 60 years and equal sex representation.

BOMAT-Advanced Short

The Bochumer Matrizentest (BOMAT) is a non-verbal intelligence test which is widely used in neuroscientific research [3, 31,32,33]. Its structure is similar to the well-established Raven’s advanced progressive matrices [4]. Within the framework of our study, we carried out the advanced short version of the BOMAT, which is known to have high discriminatory power in samples with generally high intellectual abilities, thus avoiding possible ceiling effects [32, 33]. The test comprises two parallel forms with 29 matrix-reasoning items each. Participants were assigned to one of the two forms, which they had to complete. The participants have a total of 45 min to process as many matrices as possible. The split-half reliability of the BOMAT is 0.89, Cronbach’s α is 0.92, and reliability between the parallel forms is 0.86. The recent norming sample consists of about 2100 individuals with an age range between 18 and 60 years and equal sex representation.

BOWIT

The Bochumer Wissenstest (BOWIT) is a German inventory to assess the subject’s degree of general knowledge [5]. The inventory comprises two parallel forms with 154 items each. Both forms include eleven different facets of general knowledge: arts/architecture, language/literature, geography/logistics, philosophy/religion, history/archeology, economics/law, civics/politics, biology/chemistry, nutrition/health, mathematics/physics, and technology/electronics. Within one test form, each knowledge facet is represented by 14 multiple-choice items. In our study, all participants had to complete both test forms resulting in a total number of 308 items. The processing time for each test form is about 45 min. The knowledge facets assessed by the BOWIT are very similar to general knowledge inventories used in other studies [34,35,36]. The inventory fulfills all important quality criteria regarding different measures of reliability and validity. The inventory’s manual specifies that split-half reliability is 0.96, Cronbach’s α is 0.95, test–retest reliability is 0.96, and parallel-form reliability is 0.91. Convergent and discriminant validity are given for both test forms. The norming sample consists of about 2300 individuals (age range: 18–66 years) and has an equal sex representation.

DNA Sampling and Genotyping

For non-invasive sampling, exfoliated cells were brushed from the oral mucosa of the participants. DNA isolation was performed with QIAamp DNA mini Kit (Qiagen GmbH, Hilden, Germany). Genotyping was carried out using the Illumina Infinium Global Screening Array 1.0 with MDD and Psych content (Illumina, San Diego, CA, USA) at the Life & Brain facilities, Bonn, Germany. Filtering was performed with PLINK 1.9 [37, 38] removing SNPs with a minor allele frequency of < 0.01, deviating from Hardy–Weinberg equilibrium with a p value of < 1*10–6, and missing data > 0.02. Participants were excluded with > 0.02 missingness, sex-mismatch, and heterozygosity rate >|0.2|. Filtering for relatedness and population structure was carried out on a SNP set of filtered for high quality (HWE p > 0.02, MAF > 0.2, missingness = 0), and LD pruning (r2 = 0.1). In pairs of cryptically related subjects, pi hat > 0.2 was applied to excluded subjects at random. Principal components to control for population stratification were generated, and outliers >|6SD| on one of the first 20 PC were excluded. The final data set consisted of 518 participants and 494,740 SNPs.

Polygenic Scores

We created genome-wide polygenic scores for each participant using publicly available summary statistics for intelligence (N = 269,867), cognitive performance (N = 257,828), and educational attainment (excl. 23andMe; N = 766,345) [24, 25]. Polygenic scores were constructed as the weighted sums of each participant’s trait-associated alleles across all SNPs using PRSice 2.1.6 [39]. In regard to the highly polygenic nature identified for EA and IQ and the observed range of highest prediction of the PGS in the original manuscripts [24, 25], we applied a p value threshold (PT) of 0.05 for the inclusion of SNPs in the calculation of IQ-PGS, CP-PGS, and EA-PGS. Additionally, we report the results for the PGS with the strongest association with the respective cognitive tests (and subtests) in our sample (best-fit PGS). That is, the p value threshold (PT) for inclusion of SNPs was chosen empirically (for the range of PT 5*10–8 − 0.5 in steps of 5*10–5), so the resulting PGS explained the maximum amount of test score variance for the respective measure in our sample. Finally, we also investigated the predictive power of PGS including all available SNPs (non-fit PGS), that is, the p value threshold for SNP inclusion equaled PT = 1.00. The predictive power of the PGS derived from the GWAS was measured by the “incremental R2” statistic [25]. The incremental R2 reflects the increase in the determination coefficient (R2) when the PGS is added to a regression model predicting the behavioral phenotype alongside a number of control variables (here: sex, age, and the first four principal components of population stratification). For all statistical analyses in PRSice, linear parametric methods were used. Testing was two-tailed with an α-level of p < 0.05. As we report a total of 81 regression analyses, our results were FDR corrected for multiple comparisons and the corrected α-levels were in the range between 0.05/81 = 0.000617 and 0.05 as defined by the Benjamini–Hochberg method [40]. Since the sexes differed significantly for some of the phenotypes (see results), we also calculated the abovementioned analyses in an exploratory fashion separately for the sexes. Control variables were age, and the first four principal components of population stratification. For PT = 0.05 and PT = 1, we applied the same procedure as above. For the best-fit approach, we chose the full sample best-fit PT of the respective phenotype and applied it to the subsamples. Finally, we test whether the predictive power of PGS differentiates between females and males by comparing the incremental R2 between the two groups correcting for multiple comparisons using the Benjamini–Hochberg method as described above.

Results

In the following section, we report incremental determination coefficients (incremental R2) for the PGS with a PT of 0.05 and respective test scores. Additionally, we investigated the association between our cognitive test scores and, so-called best-fit IQ-PGS, CP-PGS, and EA-PGS. These best-fit PGS were estimated by using a function, which empirically determines a p value threshold for SNP inclusion to maximally predict the performance in the respective cognitive test (see Supplementary Material S3, S4, and S5 for intercorrelations between best-fit PGS). Finally, we explored the predictive power of non-fit PGS including all available SNPs (PT = 1.00) and cognitive test scores.

For PT of 0.05, IQ-PGS was especially predictive of individual differences in verbal (incremental R2 = 3.29%, p < 0.001) and numerical intelligence (incremental R2 = 3.05%, p < 0.001) measured with the IST-2000-R. The predictive power for differences in general intelligence was only slightly lower (Fig. 1). Here, IQ-PGS had an incremental R2 of 2.71% (p < 0.001). Regarding measures of general knowledge, IQ-PGS had an incremental R2 of 1.85% for ability differences in the IST-2000-R general knowledge test (p < 0.001) and 1.34% for differences assessed with the BOWIT (p = 0.001). Differences in non-verbal aspects of intelligence were weaker predicted. Our multiple regression analyses resulted in predictive values of R2 = 1.04% for processing speed (p = 0.017), R2 = 0.89% for matrices (p = 0.029), and R2 = 0.13% for figural intelligence (p = 0.406). Also, the predictive power of IQ-PGS for memory was weak (incremental R2 = 0.80%, p = 0.033).

Fig. 1
figure 1

Incremental R2 of the p value threshold (PT) = 0.05 polygenic scores of intelligence (IQ-PGS), cognitive performance (CP-PGS), and educational attainment (EA-PGS) in percent. The incremental R2 reflects the increase in the determination coefficient (R2) when the IQ-PGS or CP-PGS or EA-PGS is added to a regression model predicting individual differences in the respective cognitive test. The association between PGS and phenotype was controlled for the effects of sex, age, population stratification, and multiple comparisons [40]. *Adjusted p ≤ 0.05, **adjusted p ≤ 0.01, ***adjusted p ≤ 0.001

Individual differences in CP-PGS were especially predictive of individual differences in verbal (incremental R2 = 4.64%, p < 0.001) and general intelligence (incremental R2 = 4.27%, p < 0.0021) measured with the IST-2000-R. The predictive power for differences in numerical intelligence was only slightly lower (Fig. 1). Here, CP-PGS had an incremental R2 of 3.24% (p < 0.001). Regarding measures of general knowledge, CP-PGS had an incremental R2 of 2.22% for ability differences in the IST-2000-R general knowledge test (p < 0.001) and 1.60% for differences assessed with the BOWIT (p < 0.001). Differences in language-free components of general intelligence, measured as figural intelligence (incremental R2 = 1.15%, p = 0.014) and matrices (incremental R2 = 1.47%, p = 005), were slightly weaker explained by CP-PGS (Fig. 1). Interestingly, CP-PGS had a high predictive power for processing speed (incremental R2 = 2.66%, p = 0.001). The least predictive power was detected for individual differences in memory performance (incremental R2 = 1.07%, p = 0.014).

Individual differences in EA-PGS were especially predictive of individual differences in general intelligence (incremental R2 = 2.06%, p < 0.001). IST-2000-R verbal intelligence (incremental R2 = 1.61%, p = 0.004) and numerical intelligence (incremental R2 = 1.27%, p = 0.007), as well as aspects of general knowledge assessed by IST-2000-R (incremental R2 = 1.39%, p = 0.002) and BOWIT (incremental R2 = 1.25%, p = 0.002), were moderately predicted by EA-PGS. (Fig. 1). Differences in language-free components of general intelligence, measured as figural intelligence (incremental R2 = 1.17%, p = 0.013) matrices (incremental R2 = 0.70%, p = 0.054) and processing speed (incremental R2 = 0.67%, p = 0.057), were partially only weakly explained by EA-PGS (Fig. 1). Again, the predictive power of EA-PGS in individual differences in memory performance was weak (incremental R2 = 0.67%, p = 0.051) (Fig. 1).

Apart from the p value threshold (PT) of 0.05, we also investigated the predictive power of PGS with the strongest association with the respective cognitive tests (and subtests) in our sample, so-called best-fit PGS.

Here, best-fit IQ-PGS were especially predictive of individual differences in general intelligence measured with the IST-2000-R (incremental R2 = 4.95%, p < 0.001). The predictive power for numerical intelligence and verbal intelligence was only slightly lower (Fig. 2). Here, IQ-PGS had an incremental R2 of 4.41% for numerical intelligence (p < 0.001) and 4.20% for differences in verbal intelligence (p < 0.001). Regarding measures of general knowledge, IQ-PGS had an incremental R2 of 2.38% for ability differences assessed with the IST-2000-R general knowledge test (p < 0.001) and 1.65% for differences in BOWIT (p < 0.001). Moreover, 2.00% of the difference in figural intelligence (p = 0.001), 2.04% of the performance difference in the BOMAT (p < 0.001), and 1.99% in processing speed (p < 0.001) were additionally explained by including IQ-PGS into the respective regression model. As depicted in Fig. 2, IQ-PGS had a low predictive power for memory (incremental R2 = 1.08%, p = 0.013).

Fig. 2
figure 2

Incremental R2 of the best-fit polygenic scores of intelligence (IQ-PGS), cognitive performance (CP-PGS), and educational attainment (EA-PGS) in percent. The p value thresholds (PT) that determined the inclusion of SNPs into the respective PGS are displayed in the respective bar. The incremental R2 reflects the increase in the determination coefficient (R2) when the IQ-PGS or CP-PGS or EA-PGS is added to a regression model predicting individual differences in the respective cognitive test. The association between PGS and phenotype was controlled for the effects of sex, age, population stratification, and multiple comparisons [40]. *Adjusted p ≤ 0.05, **adjusted p ≤ 0.01, ***adjusted p ≤ 0.001

Individual differences in CP-PGS were especially predictive of individual differences in verbal intelligence (incremental R2 = 5.63%, p < 0.001), general intelligence (incremental R2 = 5.47%, p < 0.001), and numerical intelligence (incremental R2 = 3.91%, p < 0.001) measured with the IST-2000-R (Fig. 2). Aspects of general knowledge assessed by IST-2000-R (incremental R2 = 3.00%, p < 0.001) and BOWIT (incremental R2 = 1.96%, p < 0.001) were moderately predicted by CP-PGS. Differences in language-free components of general intelligence, measured as figural intelligence (incremental R2 = 2.49%, p < 0.001) and matrices (incremental R2 = 2.47%, p < 0.001), were also moderately explained by CP-PSG (Fig. 2). Interestingly, CP-PGS had a higher predictive power for processing speed (incremental R2 = 3.02%, p < 0.001). The least predictive power was detected for individual differences in memory performance (incremental R2 = 1.57%, p = 0.003).

For EA-PGS, we especially found a high predictive power for individual differences in general intelligence (incremental R2 = 3.12%, p < 0.001) and general knowledge assessed by the IST-2000-R (incremental R2 = 3.05%, p < 0.001). Verbal intelligence (incremental R2 = 2.61%, p < 0.001) and numerical intelligence (incremental R2 = 2.48%, p < . 001), measured with the IST-2000-R (Fig. 2) and general knowledge assessed by BOWIT (incremental R2 = 2.33%, p < 0.001), were moderately predicted by EA-PGS. Differences in non-verbal aspects of intelligence were weaker predicted. Our multiple regression analyses resulted in predictive values of R2 = 1.76% for processing speed (p = 0.002), R2 = 1.66% for matrices (p = 0.003), and R2 = 1.64% for figural intelligence (p = 0.003). Again, the lowest predictive power was detected for individual differences in memory performance (incremental R2 = 1.38%, p = 0.005).

Next, we investigated the predictive power of PGS summarizing the effects of all SNPs (PT = 1.00, non-fit PGS, Fig. 3). Here, again, the increment in the determination coefficient caused by IQ-PGS was especially high for numerical (incremental R2 = 3.00%, p < 0.001), verbal (incremental R2 = 2.31%, p < 0.001), and general intelligence (incremental R2 = 1.94%, p = 0.001) measured with the IST-2000-R. Moreover, IQ-PGS increases the determination of the differences in general knowledge measured via IST-2000-R by 1.50% (p = 0.001) and by 1.21% (p = 0.002) for knowledge differences assessed by BOWIT. While differences in processing speed were predicted with an incremental R2 of 1.17% (p = 0.012), IQ-PGS had the lowest incremental effect on predicting differences in matrices (incremental R2 = 0.60%, p = 0.074), memory (incremental R2 = 0.71%, p = 0.045), and figural intelligence (incremental R2 = 0.00%, p = 0.982).

Fig. 3
figure 3

Incremental R2 of the non-fit polygenic scores of intelligence (IQ-PGS), cognitive performance (CP-PGS), and educational attainment (EA-PGS) in percent. p value threshold (PT) = 1. The incremental R2 reflects the increase in the determination coefficient (R2) when the IQ-PGS or CP-PGS or EA-PGS is added to a regression model predicting individual differences in the respective cognitive test. The association between PGS and phenotype was controlled for the effects of age, sex, population stratification, and multiple comparisons [40]. *Adjusted p ≤ 0.05, **adjusted p ≤ 0.01, ***adjusted p ≤ 0.001

A similar picture emerges for the predictive power of CP-PGS. The highest predictive power occurred for measures of verbal intelligence (incremental R2 = 3.64%, p < 0.001), general intelligence (incremental R2 = 2.59%, p < 0.001), and numerical intelligence (incremental R2 = 2.06%, p < 0.001). The predictive power for differences in general knowledge assessed by IST-2000-R (incremental R2 = 1.99%, p < 0.001) and BOWIT (incremental R2 = 1.41%, p < 0.001) but also processing speed (incremental R2 = 1.93%, p = 0.001) was only slightly lower (Fig. 3). Measures of non-verbal intelligence and memory were only poorly predicted by CP-PGS (matrices: R2 = 0.99%, p = 0.022; figural intelligence: R2 = 0.38%, p = 0.158; memory: R2 = 0.65%, p = 0.054).

Individual differences in EA-PGS were especially predictive of individual differences in general intelligence (incremental R2 = 3.12%, p < 0.001), general knowledge assessed by IST-2000-R (incremental R2 = 3.05%, p < 0.001). General knowledge defined by BOWIT (incremental R2 = 2.33%, p < 0.001) and verbal (incremental R2 = 2.61%, p < 0.001) and numerical intelligence (incremental R2 = 2.48%, p < 0.001) were moderately predicted by EA-PGS. As shown in Fig. 3, measures of non-verbal intelligence and memory were only poorly predicted by IQ-PGS (processing speed: R2 = 1.36%, p = 0.006; matrices: R2 = 1.65%, p = 0.004; figural intelligence: R2 = 1.09%, p = 0.017; memory: R2 = 1.38%, p = 0.005).

Overall, for the associations between cognitive measures and PT = 0.05, PT = 1.00 or best-fit PGS, a comparable pattern emerged, albeit the magnitude of explained variance deviated. Here, in general, the predictive power of the PT = 0.05 and PT = 1.00 was somewhat lower than for the best-fit PGS. Moreover, the amount of explained variance was, in most cases, higher for IQ-PGS and CP-PGS than for EA-PGS (Figs. 1, 2, and 3).

Finally, we tested whether there are sex-differences in the cognitive test scores and the predictive power of respective PGS. We did not observe a significant sex difference with regard to age (t(555) = 0.96, p = 0.304) and verbal intelligence (t(555) = 1.49, p = 0.14), processing speed (t(555) = 0.85, p = 0.39), and figural intelligence (t(555) = 1.32, p = 0.19). However, we found that females scored significantly higher on memory performance (t(555) =  − 4.87, p < 0.001), while males achieved significantly higher scores for general intelligence (t(555) = 5.01, p < 0.001), matrices (t(555) = 2.21, p = 0.03), numerical intelligence (t(555) = 7.65, p < 0.001), and general knowledge assessed by IST-2000-R (t(555) = 10.42, p < 0.001) and BOWIT (t(555) = 9.63, p < 0.001). Given the substantial sex differences for most of the cognitive test scores, we decided to compute the aforementioned predictive power of the respective PGS for both sexes separately (Fig. S1 to S3). Although the pattern of results seems to be different for females and males, 78 out of 81 incremental R2 comparisons between the sexes were not significant (p > . 05). Statistical analysis indicated a significant sex-difference in incremental R2 of EA-PGS for matrices (p < 0.05) (PT = 0.05 EA-PGS; p = 0.04, best-fit EA-PGS; p = 0.0037, EA-PGS PT = 1.00; p = 0.0039) in the first place; however, none of the p values survived the control for multiple comparisons as defined by the Benjamini–Hochberg method.

Discussion

Polygenic scores for intelligence, cognitive performance, and educational attainment are increasingly used to investigate associations between genetic disposition for cognitive abilities and different life outcomes [21,22,23,24,25]. It is, however, currently not known which aspects of general intelligence are reflected to what extent by available PGS, derived from large GWAS. Here, we show that IQ-PGS, CP-PGS, and EA-PGS do not predict every form of cognitive ability equally well (Figs. 1, 2, and 3). Specifically, we found for all three PGS for the whole group and separated for the sexes the pattern that all PGS had a high predictive power for interindividual differences in general, verbal, and numerical intelligence. In contrast, memory was only weakly associated. The pattern for matrices was not consistent. For the whole group, as well as for males, we found a poor association for matrices, whereas for females, we found a higher predictive power.

Previous findings investigated the predictive power of single PGS in cognitive abilities. Liu et al. [41] used EA-PGS to predict verbal and matrix reasoning, and a recent preprint by Loughnan et al. [42] investigated the predictive power of IQ-PGS on different cognitive facets in children. Although a direct comparison between these studies and our results is not straightforward since we report incremental R2 for the predictive power and these studies used the standardized regression coefficient beta, by and large, our findings are in accordance with the results reported in these studies. More specifically, PGS estimated from cognitive ability approximations, like intelligence, cognitive performance, and educational attainment, are more strongly associated with crystallized cognitive abilities compared to fluid abilities. This finding is counterintuitive at the first glance, since previous classical models [43] assumed that crystallized abilities would be less influenced by genetics but more impacted by environmental factors, like education. Nevertheless, recent evidence from a meta-analytic twin study showed higher heritability estimates for crystallized compared to fluid abilities [44]. Here, Kan et al. [44] speculated that these findings could be explained in terms of genotype-environment covariance. Because the acquisition of crystallized abilities (e.g., knowledge) depends on fluid abilities (e.g., cognitive processing) as described by the investment hypothesis [43], individuals who develop relatively high levels of cognitive-processing abilities tend to achieve relatively high levels of knowledge. More specifically, high achievers are more likely to end up in cognitively demanding environments which facilitate the initial genetic predisposition and thus the further development of a wide range of knowledge and skills. If, in addition, stimulating environments foster societally valued knowledge and skills more than cognitive processing per se, data simulations with dynamical models indicate that heritability coefficients of crystallized abilities could exceed those of fluid abilities [44]. Although this assumption seems plausible, we assume an alternative and more parsimonious explanation for these findings. We suggest that the predictive power of PGS in target samples could be influenced by the design of the GWAS that were used to discover the PGS in the first place. In typical GWAS, sample sizes tend to be very large, which usually comes at the cost of light phenotyping. It follows that PGS, which summarize trait-associated effect sizes of single SNPs, will reflect the genetic basis of the measured phenotype. In the case of educational attainment, this was defined as the years of schooling an individual completed [25]. Because verbal and numerical intelligence reflect culturally acquired abilities [2], it is not surprising that PGS based on individual differences in years of education are primarily associated with individual differences in these aspects of intelligence, and less so with differences in non-verbal intelligence and memory. Of note, a recent analysis indicates that non-cognitive genetic factors, i.e., genetic variation in educational outcomes not explained by genetic variation in cognitive ability, accounted for more than half of the genetic variance in EA, and that heritable non-cognitive skills influence personality characteristics, and downstream health outcomes [45].

Interestingly, the same applies to the IQ-PGS and CP-PGS. These scores were based on a GWAS meta-analysis that was mainly driven by the UK Biobank subsample. With almost 200,000 participants, the UK Biobank sample contributed at least above two-thirds of the meta-analysis sample [25]. In the UK Biobank, cognitive abilities were measured as the number of correct answers to a total of 13 questions assessing both verbal and mathematical intelligence. Although the respective questions are considered to measure intelligence in the form of verbal and numerical abilities, the number of correct answers seems to be dependent on culturally acquired knowledge rather than on deductive and inductive reasoning [46]. This becomes clear when taking a closer look at individual items. For example, arithmetic capability is measured with the following item: “If David is twenty-one and Owen is nineteen and Daniel is nine years younger than David, what is half their age combined?” [46]. As this is a classic word problem, it is not language-free and, therefore, clearly affected by cultural expertise. Again, the selection of test procedures used to determine individual differences in cognitive ability in the discovery sample appears to have a distorting influence on the predictive power of the PGS in other target samples. This is particularly important when one wants to investigate the functional pathways between genotype and phenotype. Researchers who aim to bridge the gap between genetic and neuronal correlates of cognitive abilities should be aware of this issue as previous studies have shown that distinct aspects of intelligence have distinct neural correlates [33]. Although our study provides important insights into the predictive power of commonly used PGS, the following limitations need to be considered. First, it has to be noted that with the present design, we cannot distinguish between direct genetic effects on cognitive performance and indirect, environmentally mediated parental genetic effects in form of genotype-environment correlations (e.g., cognitively enriching environments provided by parents [47]). Study designs combining family data with genotypic data will be needed to further dissect those contributions [48]. Second, in the present study, the training data was primarily of European ancestry, and the target sample was a homogenous sample largely consisting of German university students. While polygenic prediction has been demonstrated to work best for discovery and target samples of matching ethnic background [42, 49], as in the present study, the availability of large-scale discovery samples for other ancestries, e.g., of African or Asian populations, would be indispensable for the comparative analysis of polygenic contribution to cognition in different ancestries and cultures. Third, our sample was mainly composed of university students with a restricted age range. Previous studies show that our results are in accordance to associations patterns for various cognitive domains in children [42]. However, independent replication of our results in diverse samples with similar age ranges but different educational backgrounds is highly desirable [50].

Forth, the distinctive feature of our study is that we report the predictive power of various PGS separately for the sexes. Our results show that for both sexes, all three PGSs have higher predictive power for general, verbal, and numerical intelligence lower for memory. The pattern for matrices was not consistent. For the whole group, as well for males, we found a poor association, whereas for females, we found a very high predictive power especially for EA-PGS. However, one has to be careful with these interpretations, because the patterns of both sexes and the comparison do not differ significantly from each other when controlling for multiple comparisons. Since the analyses for both sexes are based on an exploratory analysis and were performed with significantly smaller sample sizes, it is not surprising that they suffer from reduced statistical power compared to the whole group analyses. Therefore, we encourage future studies to systematically investigate sex differences in terms of the predictive power of various PGS in their cohorts.

Fifth, for the selection of a suitable p value threshold for SNP inclusion, we were guided by the original publications of Savage et al. [24] and Lee et al. [25]. However, there is no clear specification as to which p value threshold achieves the best overall predictive power regarding cognitive ability differences. Therefore, we additionally used a best-fit approach, which yielded different p value thresholds for SNP inclusion for the different cognitive tests (Fig. 2). Here, p value thresholds for the various PGS ranged from PT = 0.0001 to PT = 1 but were highly intercorrelated (see Supplementary Material S3, S4, S5) so that the different best-fit scores are highly comparable regarding their SNP composition. As using a best-fit approach can lead to an overestimation of the observed explained variance [39], we also report the associations of non-fit PGS with a p value threshold of PT = 1.00 (Fig. 3). With this approach, PGS that take all available SNPs for either intelligence, cognitive performance, or educational attainment into account were tested for associations with the different cognitive tests. Although PT = 0.05 and PT = 1.00 have a consistently lower predictive power than the best-fit PGS, they exhibit a comparable pattern of results. This confirms that polygenic scores of intelligence, cognitive performance, and educational attainment derived from large GWAS [24, 25] reflect some aspects of intelligence, e.g., verbal and numerical intelligence, more accurately than others, like memory. In conclusion, our study is the first which systematically investigates the association between polygenic scores of intelligence [24], cognitive performance, and educational attainment [25] and a wide range of general intelligence aspects in healthy adults. The study shows how large-scale light-phenotyping GWAS studies with a strong statistical power to identify associated variants, and smaller comprehensive deep-phenotyping approaches with a fine-grained assessment of the phenotype of interest can complement each other. We demonstrate that the discovery GWAS for intelligence, cognitive performance, and educational attainment do not reflect every form of cognitive ability equally well. Realizing that the way of phenotyping in large GWAS affects the predictive power of the resulting PGS [see also 51] is essential for all future studies planning to use PGS to unravel the genetic correlates of cognitive abilities in their samples.