In the last 15 years, genome-wide association studies (GWASs) have identified tens of thousands of associations between genetic variants and a range of human behavioral and physical traits. One gene that has popped up surprisingly often in behavioral GWASs is the cell adhesion molecule 2 gene (CADM2). Common variations (single nucleotide polymorphisms, SNPs) in the CADM2 gene have been implicated in various traits, including substance use traits (Pasman et al. 2018; Liu et al. 2019) and risk-taking behavior (Strawbridge et al. 2018; Arends et al. 2021), but also in traits associated with personality (Boutwell et al. 2017), cognition and educational attainment (Ibrahim-Verbaas et al. 2016; Lee et al. 2018), reproductive success (Day et al. 2016), autism spectrum disorders (Casey et al. 2012), physical activity (Klimentidis et al. 2018), BMI/obesity (Locke et al. 2015; Morris et al. 2019), and metabolic traits (Morris et al. 2019).

CADM2 encodes a member of the synaptic cell adhesion molecules (SynCAMs) involved in synaptic organization and signalling, suggesting that alterations in CADM2 expression affect neuronal connectivity. CADM2 is expressed more abundantly in brain tissue than in other tissue and in particular in areas important for reward processing and addiction, including the frontal anterior cingulate cortex (Ibrahim-Verbaas et al. 2016), substantia nigra, and insula (Ndiaye et al. 2019). Accordingly, CADM2 is a gene that warrants further exploration.

In this study we performed a phenome-wide association analysis (PheWAS), in which we tested for associations of CADM2 (on SNP and gene level) with a comprehensive selection of psycho-behavioral phenotypes as measured in the UK Biobank cohort. Results provide insights about whether the role of CADM2 is confined to a specific set of traits or is involved in a wider range of phenotypes. This will inform future studies on the function of CADM2 and the neurobiological underpinnings of different psycho-behavioral traits. An additional advantage is that the multiple testing burden is reduced as compared to genome-wide studies, resulting in higher statistical power.

UK Biobank is a nationwide study in the United Kingdom containing phenotypic and genetic information for up to 500,000 individuals (Bycroft et al. 2018). We analyzed data from 12,211 to 453,349 UK Biobank participants with European ancestry for whom genetic and phenotypic data were available. About half (54.3%) of the sample was female, and mean age was M = 56.8 (range 39–73, SD = 8.0). We extracted the CADM2 region 250 kb up- and downstream (all HRC best-guess imputed SNPs from bp 84,758,133 to 86,373,579 on 3p12.1, GRCh37/hg19) and selected 4,265 SNPs with missingness rates < 5%, minor allele frequency > 1%, and p-value for violation of Hardy–Weinberg equilibrium above 10–6 (quality control details are described in ); with the only difference that we included all HRC imputed SNPs, whereas Abdellaoui (2020) only included HapMap3 SNPs).

We selected 242 psychological and behavioral phenotypes, representing 12 categories, with a sample size above N = 10,000 (for binary traits we used effective sample size \({\text{N}}_{{{\text{eff}}}} = {4}/\frac{{1/N_{cases} }}{{1/N_{controls} }}\)). To maximize sample size, we used the first available measurement for each individual; if the first instance was not available, we took the second, otherwise the third, etc. In addition, we included eight traits that were derived for recent genetic studies, including seven substance use traits and educational attainment in years (for an overview of all included traits, see Table S1). Continuous phenotypes were cleaned such that theoretically implausible values were set on missing and extreme values of more than 4 SDs from the mean were winsorized at 4SDs from the mean. Binary and ordinal variables were left unchanged. Ordinal variables were analyzed as continuous variables.

The SNP-based association analyses were performed in fastGWA (Jiang et al. 2019), taking into account genetic relatedness. Analyses were controlled for effects of age, sex, and 25 genetic principal components [PCs, to control for genetic ancestry (Abdellaoui et al. 2019)]. We used linear mixed modeling for all traits and Haseman-Elston regression to estimate the genetic variance component. To test the significance of CADM2-associations on gene-level, we conducted a MAGMA gene-based test (de Leeuw et al. 2015), which aggregates the SNP effects (regardless of direction) in a single test of association. We used the default SNP-wise mean procedure (averaging SNP effects across the gene) and checked the results of the SNP-wise top procedure for comparison (this procedure is more sensitive when only a small proportion of SNPs has an effect). As significance threshold for the SNP-based test we adopted a genome-wide significance threshold of p < 5E−08. As this is rather stringent given that we test within a single gene, we also used a significance threshold of 0.05 corrected for the number of independent SNPs (n = 133, at R2 = 0.10 and 250 kb) and the number of traits, resulting in 0.05/(133*242) = 1.55E−06. For the gene-based test we used a threshold of 2.62E−05, corresponding to 0.05 divided by the total number of genes included in the test (19,082). To provide an estimation of the effect size of the top-SNP for each trait, we used \(R^{2} = \frac{{2\beta^{2} MAF\left( {1 - MAF} \right)}}{{2\beta^{2} MAF\left( {1 - MAF} \right) + \left( {se\left( \beta \right)} \right)^{2} 2N MAF\left( {1 - MAF} \right)}}\), as described in (Shim et al. 2015), with adaptations for binary traits as described in (Pasman et al. 2018).

At the SNP-level, 37 traits (out of 242) reached significant associations at a genome-wide corrected p-value, and 58 traits at the lenient threshold of p < 1.55E−06 (Fig. 1a, Table 1). In the gene-based test, 50 traits showed significant associations (Fig. 1b, Table 1). Thirteen of the 60 substance use traits showed a significant association with CADM2. Furthermore, strong associations were found for cognitive ability, risk taking, diet, BMI, daytime sleeping, sedentary behaviors, nervousness-like traits, and reproductive traits. There were fewer associations with occupational, traumatic experiences, social connection, and non-worry related depression traits. Full SNP and gene-based results are provided in Tables S2 and S3a and Figs. S1a and S1b. Table S3b shows the gene-based results for the SNP-wise top procedure. There were some differences with the SNP-wise mean results, with only 34 significant associations and a correlation of r = 0.64 between the p-values from the respective tests.

Fig. 1
figure 1figure 1

PheWAS results. Panel A shows the subset of significant associations of the SNP-based test (58 out of 242 traits). The x-axis shows the traits (colored by trait category) and the y-axis the p-values of the association. Each dot represents a SNP association. SNPs exceeding the red horizontal line have a p-value significant at a genome-wide threshold of p = 5E−08. The blue horizontal line represents the suggestive threshold of p = 1.55E−06. Full SNP-based results are given in Supplementary Fig. 1. Panel B shows the subset of significant results of the MAGMA gene-based test (50 out of 242 traits), with p-values on the y-axis. The red dotted line represents a threshold of p = 2.62E−06. The full gene-based results are depicted in Supplementary Fig. S2

Table 1 Phenotypes with a significant association with CADM2 according to the MAGMA gene-based test (SNP-wise mean) at p < 2.62E−06

In the main PheWas analysis, we controlled for potential bias in estimated associations due to population stratification using 25 genetic PCs. However, CADM2 is located in a long-range linkage disequilibrium (LD) region, making it potentially unfeasible to adequately control for population structure with PCs. Also, there may be genetic signal picked up by genetic association analyses that is due to social stratification, which will not be accounted for by these 25 PCs. We therefore performed a sensitivity analysis in which we—in addition to the 25 PCs—controlled for the participants’ region of birth and region of current address (see Supplementary methods). Controlling for these geographical covariates attenuated the association results: from the 50 significant trait associations at the gene level, 26 were no longer significant, and on average the betas of the top-SNPs within these genes were attenuated with by 16% (Table S3c, Fig. S1c). These findings implicate that (social) stratification introduces regional-level gene-environment correlations that affect the genetic association results (Abdellaoui 2020), although the lower number of significant gene associations could in part be the result of reduced power due to the inclusion of hundreds of dummy covariates coding geographical region. Even after controlling for effects of stratification/gene-environment correlation there remained evidence of widespread associations with CADM2.

We assessed whether the high number of associations discovered for CADM2 was unusual or similar to those found for other genes. We therefore selected a random set of 50 genes (that were maximum 50% smaller or larger), repeated the SNP-based analysis for these genes and compared the number of traits with significant associations. Most of the random comparison genes contained fewer than 5 SNP-trait associations, with an average of 2.6 associated traits per gene and a maximum of 13 (as compared to 50 for CADM2; Table S4). We additionally made a comparison with five large genes from regions with a similar level of LD as the CADM2 region (five was the number of similarly sized genes that were within LD regions defined in Price et al. (2008)). The number of significant associations within these genes was still substantially lower than those in CADM2 (maximum 6, Table S5). Results from these comparison analyses show that the high number of associations discovered for CADM2 is exceptional (Fig. S2).

The CADM2 SNPs that showed the highest number of significant trait-associations (with a maximum of 26 traits at p < 1.55E−6, Table S6) clustered around loci at 85.53 and 85.62 Mb. As can be seen in Fig. 2, most SNPs that were independently (LD R2 < 0.01, distance > 250 kb) significantly associated with at least one trait cluster in the middle of the gene, a region rich in expression quantitative trait loci (eQTLs).

Fig. 2
figure 2

The top 100 most significant SNPs for each trait with at least 1 significant SNP. The x-axis represents the base pair position, and the panel below shows information on the CADM2 transcripts as derived from https://www.ensembl.org/

To further investigate eQTL effects, we used S-Predixcan with the 49 precalculated GTEx Elastic Net models (Barbeira et al. 2018) to establish association between traits and CADM2 expression levels in 17 brain and non-brain tissues (see Supplementary Methods). From each trait category (with significant associations, N = 9) we selected the trait with the strongest association with CADM2. For all traits we found significant associations with CADM2 expression in multiple tissues (Table S7, Fig. 3). Highly significant effects were observed for lung, mammary, and adipose tissues across all traits. CADM2 expression in brain tissues was significantly associated with many traits, including risk taking, nervous feelings, and hot drink temperature. Smaller to negligible effects were observed for spleen and tibial nerve tissues.

Fig. 3
figure 3

S-predixcan results testing association between the GWASs of selected top traits and CADM2 expression in a range of tissues. S-PrediXcan was run with elastic net models based on GTEx v8 expression data. On the y-axis are the FDR-corrected log-transformed p-value, with the red line representing the significance threshold of pFDR = 0.05

This PheWAS showed that CADM2 was involved in a wide spectrum of traits, thereby reproducing and extending on previous findings. Interestingly, comparison with 50 other genes showed that this number of trait-associations was exceptionally high, emphasizing the distinctive role of CADM2 in psycho-behavioral traits. Substance use traits did not seem highly overrepresented among the significantly associated traits, suggesting that the involvement of CADM2 is of a more general nature. Many of the associations we found have been reported in previous literature [Table S6, based on GWAS Catalog (Buniello et al. 2019)]. Others were previously calculated by Neale et al. and Watanabe et al. using PheWAS in the same dataset, but not reported in a scientific paper [see Open Targets Genetics Platform, Carvalho-Silva et al. (2019), or GWAS Atlas, Watanabe et al. (2019)]. We add to these findings by identifying trait associations that remain strong after taking into account geographical stratification (e.g., age at first sexual intercourse, nervous feelings, and risk taking), and how the strongest traits were associated with differential CADM2 expression. The variance explained by CADM2 was highest for number of children fathered, age at first sexual intercourse, and hot drink temperature. Overall, effect sizes were small (less than 0.04% for number of children), in range with what is normally found for single variants. Few associations were found in the social interaction, sleep, traumatic experiences, and occupational categories. Also, there were not many mental health traits that showed an association (8 out of 52 traits). It is interesting to note the significant associations with worry and nervousness-like traits in the absence of association with other depression- and anxiety-related traits. There may be something specific to these seemingly overlapping traits, translating to distinct biological pathways.

It needs to be noted that sample sizes for the phenotypes differed substantially (from N = 12,211 to 453,349), and as such, it is possible that the pattern of associations was driven in part by differences in power. The correlation between sample size and p-value of the gene-based test was moderate and significant, r = − 0.38 (p = 1.42E−9) showing that well-powered traits were more likely to result in a significant association. It is clear that high power was a requirement: the effect sizes of CADM2 were diminutive, as is expected for single genes and complex traits. Also, our tests were limited to the psycho-behavioral traits measured in the UK-Biobank; inclusion of more measures, such as longitudinal or non-self-report measures could contribute to a more complete picture. Still, the range of tested traits was quite broad and enabled us to discern interesting patterns.

More research is needed to elucidate these links between CADM2 and this spectrum of psycho-behavioral traits in terms of neurobiological mechanisms. For example, it could be that CADM2 is important for the learning aspects of behavior, given its role in synaptic connectivity. Speculatively, CADM2 could then contribute to reward-learning and associative learning, giving rise to risky behavior, substance use, and other kinds of behaviors that involve such processes (Volkow et al. 2016).

This study presents a comprehensive and rigorous test of associations between CADM2 and psycho-behavioral traits, showing strong associations for a wide range of traits. Results could be used as starting point for future research into the function of CADM2. Research on the trait-associations and function of CADM2 will further our understanding of the biology of behavior.