Introduction

Transposable elements (TEs) are DNA sequences that have the ability to move around in the genome. TEs constitute 53–60% of the human DNA [1, 2] and are essential elements in driving genome evolution [3]. Among non-LTR retrotransposons (Alu, LINE, and SVA), only LINE-1 (L1) can actively transpose, while Alu and SVA rely on L1’s machinery to mobilize themselves [4]. While the vast majority of TEs are no longer transpositionally active, they can still play a functional role as exapted enhancers or transcriptional start sites [5,6,7,8], by inserting transcription factor binding sites (TFBS) [9, 10] or by acting as novel RNA genes such as long non-coding RNAs (lnc-RNAs) [11]. Therefore, TEs participate in regulating the expression of nearby genes, at transcriptional and post-transcriptional levels, providing a crucial role as both cis- and trans-regulatory RNA sequences [12]. Baillie and colleagues [13] also found that protein-coding loci are disproportionately affected by TEs, with over-representation of L1s in introns and Alus in exons. Overall, TEs seem to predominantly affect neurogenesis and synaptic function, with studies suggesting a putative regulatory role of TEs in the neural genome [14,15,16,17,18]. Nonetheless, only initial research work has been systematically performed on this issue and the TE-controlled regulatory architecture of the human genome still needs to be better explored and investigated. According to recent studies, TEs’ insertion polymorphisms “can be mapped as cis-expression quantitative trait loci with substantial effects on gene expression, especially at loci involved in immune response and cognitive function” [19, 20]. A polymorphic TE insertion can be exapted as a functional element depending on its site of insertion within the genome, or it can disrupt an already existing enhancer [21]. Additionally, polymorphic TEs have been shown to be closely associated with complex phenotypes in GWAS investigations suggesting that polymorphic non-reference TEs (nrTEs) may contribute to disease phenotypes through cis-regulatory effects [22,23,24]. Interestingly, nrTEs are relatively young compared to fixed TEs, therefore they are likely to have had a role in the most recent phases of the evolution of our species, which particularly involved the brain and superior cognitive abilities.

In recent years, mounting data provided evidence that epigenetic mechanisms and TEs are playing a key role in schizophrenia and other neurological disorders [17, 25,26,27]. For example, Bundo and colleagues [15] described an increased number of somatic L1 retrotransposition in the dorso-lateral prefrontal cortex (DLPFC, Brodmann’s area 46) of people affected by schizophrenia, observing that the total number of brain-specific L1 insertions tended to be higher in schizophrenia patients, an observation confirmed by Doyle et al. [18]. In our own previous work [17] and in a recent review [28], L1 insertion sites were also reported to be preferentially localized to synapse- and schizophrenia-related genes. Guffanti et al. [25] developed a method to quantify the tissue-specific expression of TEs (as well as other ncRNAs), and found more than 650,000 expressed TEs in the DLPFC of post-mortem human brains: about 114,000 TEs are differentially expressed between schizophrenia cases and healthy controls and mostly represented by primate- or human-specific elements.

A recent study suggests another potential key role for TEs in rewiring the local functional architecture of human accelerated regions (HARs) in Schizophrenia and bipolar disorder [29]. Indeed, HARs have been implicated in neurodevelopmental and neuropsychiatric disorders [30,31,32], and most HARs are known to act as developmental enhancers that are involved in controlling and regulating human cognition [33,34,35,36,37].

In this study, our goal was to identify polymorphic TEs that can potentially contribute to schizophrenia. We choose to look at the non-reference TE content of DLFPC genomes of schizophrenic individuals (SCZ), to investigate the brain tissue-specific presence of nrTEs and possibly disentangle their somatic or germ-line origin. To accomplish this task, we will (1) compare SCZ with control (CTRL) genomes; (2) check for the presence of nrTEs and the population-specific/geographic distribution of the identified variants in the 1000 Genome data; (3) perform haplotype-based association tests; (4) explore the possible functional roles of nrTEs as cis-regulatory elements of protein-coding genes and as putative modifiers of known HARs in silico.

Subjects and methods

DNA from the DLPFC of 10 schizophrenic patients and 10 psychiatrically healthy controls has been obtained from the UCI Brain Bank, following a UCI/IRB-approved protocol. After DNA extraction from brain tissue samples and QC controls, we outsourced the whole genome sequencing to Illumina (https://www.illumina.com/services/sequencing-services.html), which then returned the assembled genomes and the fastq raw reads. Using the fastq files provided, we then realigned the raw reads to the human reference genome hs37d5 (http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/) with BWA-mem [38]. After sorting and merging with Samtools [39], we applied the GATK best practices to generate VCF files that include single nucleotide polymorphisms (SNPs) as well as insertions/deletions (Indels) (https://gatk.broadinstitute.org/hc/en-us/articles/360035894711-About-the-GATK-Best-Practices).

We searched for non-reference TEs (nrTEs: Alu, LINE1, and SVA) with the mobile element locator tool (MELT) v.2.1.5 [40], using MELT-Split with default parameters on our 20 high-coverage genomes (Supplementary Table 1). To analyze the possible presence of nrTEs—which will suggest their germline rather than somatic origin—and the geographic variability of the putative schizophrenia-related nrTEs, we additionally selected 125 samples from the “1000 Genomes Project” phase 3 [41]. These 125 samples were analyzed with MELT jointly with SCZ and CTRL samples. Only “PASS” sites were included in a single final VCF file and only nrTEs mapping in genic or regulatory regions (introns, exons, promoters, terminators, and UnTranslated Regions, UTRs) on autosomal chromosomes were considered for further analyses. Fisher tests of independence were performed to identify which nrTEs show significantly different frequencies in SCZ and CTRL. Tests were performed with one and two degrees of freedom, respectively, for allelic and genotype frequencies. nrTEs that yielded nominally significant tests (pval < 0.05) at least for allele and/or genotype frequencies were considered as putatively associated with schizophrenia.

To assess the genetic relationships among the individuals included in our dataset, as well as their ancestry, we implemented a principal component analysis (PCA) and an ADMIXTURE analysis [42], both on the whole variant dataset (single nucleotide polymorphisms, SNPs, and nrTEs) and on the nrTE-based only. Quality control (QC) was performed with the PLINK software [43].

We also performed a haplotype reconstruction procedure with SHAPEIT v.1.9 [44] on the whole variant dataset to contextualize the polymorphic inserted nrTEs into their local genetic environment and evaluate the frequency of the corresponding haplotypes within the DLPFC cohort. We then performed a haplotype association test using Beagle v.3.3.2 [45] on the nrTEs with significantly different allelic and/or genotype frequencies between SCZ and CTRL.

As highlighted in the “Introduction” section, TEs can act as cis-regulatory elements by, for example, modifying the expression of nearby genes and/or inducing alternative splicing. Therefore, we checked if non-reference TEs may act as eQTLs and/or sQTLs, by comparing our significant results with those from Cao et al. [46], based on the GTEx dataset [47].

Moreover, we verified whether the statistically significant non-reference TEs (i.e., nrTEs with significantly different allele/genotype frequencies between cases and controls) are located close to genes previously studied in the context of schizophrenia.

We also compared our 38 nrTEs with the lists of known HARs as originally discovered by Pollard et al. [48, 49], Prabhakar et al. [50], Bird et al. [51], Capra et al. [52], and Gittelman et al. [53], as well as against the HAR genes proposed by Wei et al. [54] to check if some of the nrTEs we identified are located in those regions.

Results

non-reference retrotransposon insertions

We identified 7952 nrTEs in genic/regulatory regions: 6542 Alu (82.3%), 1065 LINE-1 (13.4%), and 345 SVA (4.3%), as shown in Table 1, using MELT-Split on our 145 samples (10 SCZ, 10 CTRL and 125 normal individuals from 1000 Genomes Project, 1KGP).

Table 1 Location of the identified non-reference TEs as defined by the MELT output.

We checked the chromosomal distribution of the 7952 nrTEs and found no significant difference (Fisher pval > 0.63) between the expected and observed content of the different families of TEs (SVA, LINE-1, and Alu) (Supplementary Figs. 13).

Population structure of the dataset

To contextualize our 20 DLPFC samples in the worldwide genomic landscape, we performed PCA and ADMIXTURE analyses on nrTEs genotypes and found that the best estimate for K in the latter is 3, with CV error = 0.37350, including 125 1KGP samples from five populations. Our results (Figs. 1 and 2) show that 7338 nrTEs (93%) that we identified in our 20 DLPFC samples are also present in the 125 samples from 1000K Genomes, suggesting their germ-line rather than somatic origin, and are also useful predictors of the genomic structure of the different human populations, as confirmed by the intermediate position of Indians (ITU) between Europeans (CEU) and Chinese (CHB), as well as by the clear differentiation between Eurasian and African samples. The remaining 7.3% of the nrTEs we detected in the DLPFC DNAs of our sample (n = 534: 35 SVAs (10% of total non-reference SVAs and 1.75/subject), 56 LINE1s (5% and 2.8/subject) and 473 Alus (7% and 23.6/subject)) are unique and not shared across other samples nor are they listed in known reference databases, like euL1db [55] and gnomAD [56]. They may be regarded as somatic retrotranspositions or they may still be germline nrTEs with low frequencies since we cannot distinguish between the two possible origins. Even in case they are somatic rather than germline events, they still represent a minority of our observed nrTEs.

Fig. 1: Principal component analysis (PCA) of the DLPFC and 1KGP samples based only on non-reference TEs.
figure 1

Pink: Europeans (CEU); green: Han Chinese in Beijing (CHB); brown: controls from the DLPFC (CTRL); red: Indian Telugus (ITU); yellow: Luhya in Kenya (LWK); blue: schizophrenic individuals from the DLPFC (SCZ); violet: Yoruba in Nigeria (YRI).

Fig. 2: Admixture plot of the 20 DLPFC + 125 1KGP samples.
figure 2

ADMIXTURE plot based only on nrTEs. K = 3 is shown (CV error = 0.37350).

The ADMIXTURE plot of nrTEs shows that CTRL and SCZ share the same ancestral components of CEU, (violet). Han Chinese (orange) have their own ancestral component, as well as the two African populations. ITUs show a mixture of European and Asian components. These results, as a whole, are coherent with those obtained using SNPs [41, 57].

Accordingly, nrTEs show systematic differences in allele frequencies across populations: 3131 of 7952 non-reference TEs, 2711 Alu (41.4%), 332 LINE-1 (31.1%) and 88 SVA (25.5%), have a significant geographic stratification (Fisher pval < 0.05) (Supplementary Table 2), with 2263 (28%) presenting with an allele frequency > 5%. Among these, 1501 nrTEs are found only in African populations, 833 are exclusive of non-African populations (Europeans, Indian Telugus, and Chinese) and 955 are common to all five groups (see Supplementary Table 2 and Supplementary Fig. 6). Both methods (PCA and ADMIXTURE) highlight that DLPFC samples overlap with CEU, with the partial exception of a single SCZ sample, which presents signs of admixture with a Sub-Saharan African source (as represented by YRI and LWK).

These findings suggest that SCZ and CTRL are genetically homogeneous; consequently, variants associated with the disease condition do not depend on underlying population structure.

Comparison between SCZ and CTRL subjects

We then compared the distribution of allele and genotype insertion frequencies of nrTEs (herein, ‘counts’) between SCZ cases and normal CTRLs. We detected 38 non-reference TEs with significantly different allele/genotype counts between cases and controls: three LINE1s, three SVAs, and 32 Alus, that yielded significant Fisher tests for allele and/or genotype counts [58, 59] at the nominal p-value ≤ 0.05 (Table 2). Given the limited size of our sample, no correction for multiple testing was performed. All significant nrTEs belong to evolutionarily recent elements (L1Ta and AluY), with two exceptions: the L1 on chr12:126802943 (undetermined subfamily) and the Alu on chr7:141748320 (which belongs to the subfamily Sz, older than Y).

Table 2 List of 38 significant nrTEs that are different between SCZ and CTRL: 3 SVAs, 3 L1s, and 32 Alus.

Of these 38 nrTEs, 11 show a significant difference in allele counts only, 14 in genotype counts only, and 13 in both allele and genotype counts. Interestingly, most of these TEs also show evidence of differential segregation (Fisher test, pval < 0.05) in human populations (27 for allele counts, 24 for genotype counts, 23 for both) and are more common in European and Asian populations.

The most significant allele-wise results (pval < 0.01) among the 38 significant nrTEs findings include three Alus and one SVA, whose insertion can be found on: chr4:17150918 and chr4:23511024 (both AluY and only observed in SCZ), chr11:40727097 (AluYa5, only observed in CTRL) and chr20:5268423 (SVA, more frequent in SCZ). The three Alus show a statistically significant geographical distribution (Fig. 3 and Supplementary Table 3), presenting variable insertion frequencies across populations, while the SVA on chr20:5268423 has similar allele frequencies in all the considered populations.

Fig. 3: Geographic distribution of the most significant (allele-wise) nrTEs.
figure 3

Allele frequencies of Alus on chr4:17150918 (A), chr4:23511024 (B) and chr11:40727097 (C), compared to SVA on chromosome chr20:5268423 (D). Darker colors indicate the presence of the TE (+), while lighter colors indicate the absence (−). Following populations are displayed: Europeans (blue), Indian Telugus (red), Chinese (green) and Africans (yellow), represented by Luhya in Kenya and Yoruba in Nigeria. Allele frequencies for schizophrenic individuals and healthy controls are shown in violet and pink, respectively.

Haplotype-based association analysis

To better elucidate the potential association of the identified nrTEs with schizophrenia, we then performed a haplotype-based analysis using Beagle on the previously detected 38 variants (Table 2) and obtained two significant results. The first one was for a 188 bp haplotype that includes an AluYb on chr5:100497396 in the promoter of the ST8SIA4 gene. This haplotype is characterized by 4 polymorphisms: T + TG (where “+” points out to the presence of the nrTE and the other letters represent single nucleotide variants). It is present with 15 copies in CTRL and 2 in SCZ, suggesting a strong association (pval = 6.86 × 10−5) between the presence of the haplotype and the absence of the disease.

The second significant result was for a 1 172 bp haplotype that includes the locus of an AluYb7 on chr4:23511024 in the promoter of MIR548AJ2. Interestingly, this haplotype is characterized by the absence of the insertion, with polymorphisms GC-TTI (where “I” stands for InDel and “−” indicates the absence of the nrTE) and was found with 19 copies in CTRL and 6 copies in SCZ (pval = 3.93 × 10−5): the Alu is completely absent in CTRL samples (Table 2) and present in 7 SCZ, only in a heterozygous condition.

In-silico functional inferences for non-reference TEs

At least seven genes putatively mapped by our significant non-reference TEs have been already associated with schizophrenia: LRRC4C [60], LRRC7 [61,62,63], ST8SIA4 [64,65,66], MGAM [67], ADAMTS1 [68], MIR548AJ2 [69] and SCN5A, which is also linked to the Brugada syndrome [70,71,72].

We then compared our set of 38 significant nrTEs with the eQTLs and sQTLs TEs lists produced by Cao et al. [46] using the GTEx dataset [47] (Supplementary Table 3). Indeed, 27 TEs (3 LINE-1 and 24 Alu) (68.4%) were detected as potential eQTLs acting in different tissues, 7 of which are expressed in the brain. As for sQTLs, 13 (34.21%) of our TEs (2 LINE-1 and 11 Alu) were detected as potentially contributing to alternative splicing in different tissues, 2 of them supposedly acting in the brain (chr11:76990585 and chr2:36476695). All sQTLs TEs were also eQTLs, with the single exception of an Alu (chr5:159122155, in the promoter of ADRA1B) which works only as sQTL.

Last, we compared our set of 38 nrTEs with the lists of HARs produced by Pollard et al. [48, 49], Prabhakar et al. [50], Bird et al. [51], Capra et al. [52] and Gittelman et al. [53], as well as against the HAR genes proposed by Wei et al. [54] and found that 12 nrTEs are located within as many HAR-genes (ADAMTS1, ANKRD55, CRIM1, EDIL3, LRRC4C, LRRC7, MAF, NAV2, QDPR, TENM3, TSPAN11, XKR4), which collectively show enrichment for “regulation of neuron projection development”.

Discussion

Far from being “junk”, it has been shown that transposable elements, such as non-Long Terminal Repeats retrotransposons, can contribute to human genomic diversity in various ways. Mounting data suggest both a positive and a detrimental role of retrotransposons in shaping human cognitive traits [17] and in the development of brain and central nervous system (CNS) structures [73, 74]. Several authors also suggest that retrotransposons have an important role in neurological and psychiatric disorders, such as schizophrenia [15, 16, 18, 25, 26].

However, the full impact of TEs on the human genome is still unclear, both for technological/methodological limitations as well as our current lack of knowledge of their precise effects and interactions with other genetic/epigenetic elements. Since a relationship between cognitive disorders and reference TEs has been the subject of several recent studies [15, 25, 26] with this work, we aimed to provide the first investigation of non-reference TEs as risk factors that can potentially contribute to increasing the risk of developing schizophrenia.

In the first step, to evaluate whether nrTEs distribution can contribute to population substructure (=nrTEs frequency differences due to specific populations' origin/evolutionary trajectory) and lead to specific population patterns, we inspected the genetic distribution of traditional DNA variants (SNPs, Indels, CNVs) and of nrTEs in our 20 DLPFC subjects together with 125 worldwide individuals that we collected from the 1KGP. Our PCA and Admixture results show that nrTEs are present with mostly population-specific frequencies within our worldwide dataset, similarly to the well-known patterns previously detected in SNP-based studies and we ultimately confirm that nrTEs too contribute to the higher genomic diversity of African compared to non-African populations [41, 57, 75].

Our results (Figs. 1 and 2) also show that our SCZ and CTRL individuals fall within the European genomic variability (represented by CEU in both PCA and Admixture analyses) and share a predominant European ancestral component, except for a single individual showing signs of admixture with a Sub-Saharan African source. The non-reference, polymorphic TEs we have identified in our DLPFC samples are mostly present in more than one subject, therefore they are probably due to germline retrotranspositions. Only a limited number of events (7.3%) may be considered either somatic or low-frequency germline retrotranspositions because they are not shared across samples and are present in only one sample, thus these single events should be better considered private insertions (or singletons). The total number of singletons and their proportion per subject are concordant with the estimates provided by Watkins et al. [76] for the Caucasian/European population.

Therefore, our results confirm both that polymorphic nrTEs can be used as reliable markers for reconstructing the genomic structure (and potentially the history) of samples/populations that we analyzed, as previously suggested by others [40, 76, 77] and that our SCZ and CTRL sample presents a clear nrTEs genetic homogeneity, which allows to exclude spurious associations due to hidden population substructure.

Admittedly, our results are based on a relatively low number of subjects, and they must be considered only as a preliminary discovery investigation. However, we performed a careful evaluation of population structure (PCA and Admixture analyses) and observed that our SCZ and CTRL samples are genomically homogeneous, in addition overlapping with reference samples of European ancestry (as represented by CEU from 1000 Genomes Project). Therefore, spurious associations due to population substructure can be excluded. Then, in order to strengthen the potential association between genomic markers (nrTEs) and the investigated trait (schizophrenia), in addition to performing standard allele frequency-based association methods, we also applied haplotype-based analyses, which, as expected, yielded [78] highly significant results with a magnitude pval ≤ 10−5.

As outlined above, we first identified 38 nrTEs whose frequencies are significantly different between Schizophrenics and Controls. Even considering the low sample size, we observe that allele frequency differences for these TEs are similar or even higher than those observed between the most different ‘control’ populations for the same insertions, making it highly improbable that the observed differences emerged by chance. These 38 nrTEs constitute our set of putative candidates associated with schizophrenia. Focusing only on the most significant results (pval < 0.01), they refer to three Alus and one SVA, which respectively fall on chr4:17150918, chr4:23511024, chr11:40727097 and chr20:5268423. The first two Alus are found only in SCZ, the third only in CTRL and the SVA on chromosome 20 is more frequent in SCZ subjects. The Alu on chr11:40727097 is completely absent in SCZ (but present in 27.4% of CTRLs) (Fig. 3C) and is located in the second intron of the Leucine Rich Repeat Containing 4C (LRRC4C) gene, which is highly expressed in the frontal cortex and has been associated with a positive response to antipsychotic therapy with lurasidone in SCZ patients [60]. In our case, the absence of this Alu insertion is preferentially associated with the schizophrenic condition (see Tables 2 and 3).

Table 3 nrTEs with allele p-values < 0.01.

As highlighted in the “Introduction” section, TEs can act in cis, for example, by altering the expression of a gene or by having an impact on its alternative splicing. Therefore, we also looked at the potential role of our 38 significantly different nrTEs by comparing them with the lists produced by Cao and colleagues [46] based on the GTEx dataset [47]. We found that 27 nrTEs act as eQTLs in different tissues, with 7 showing a putative eQTL effect in the brain. For instance, the AluYb3a1 on chr9:91099740 acts as eQTL in the frontal cortex and is located in the terminator of SPIN1 (Spindlin 1). Moreover, 13 nrTEs act as sQTLs, two of them in the brain: the AluYg6 chr11:76990585 and the AluYb7 on chr2:36476695. These two Alus are located in the third intron of GDPD4 (Glycerophosphodiester Phosphodiesterase Domain-Containing Protein 4) and in the terminator of the CRIM1 gene, respectively. Interestingly, the AluYb7 on chr2:36476695 acts as sQTL in the frontal cortex, and CRIM1 encodes for the cysteine-rich neuron motor 1 protein, which is developmentally regulated and involved in CNS development and organogenesis [79], other than being part of the HAR-genes that are functionally relevant in brain networks implicated in cognition [54]. Actually, recent research suggests that TEs could change the local functional architecture of HARs in schizophrenia and bipolar disorder [29] and our present findings add a further layer of support to this hypothesis, showing that 12 of the 38 significant nrTEs fall within the ORF of genes that are enriched for a neurodevelopmental process, the “regulation of neuron projection development” (ADAMTS1, ANKRD55, CRIM1, EDIL3, LRRC4C, LRRC7, MAF, NAV2, QDPR, TENM3, TSPAN11, XKR4). It is further interesting that at least three of these genes (ADAMTS1, LRRC4C, and LRRC7) have already been associated with schizophrenia.

After inferring the haplotypic context surrounding the 38 nrTEs of interest, we performed a haplotype-based association test with Beagle, which returned two significant results: one is the AluYb on chr5:100497396, located in the promoter of the gene ST8 Alpha-N-Acetyl-Neuraminide Alpha-2,8-Sialyltransferase 4 (ST8SIA4). In particular, the haplotype-based analysis revealed that the haplotype (T + TG) with the presence of the nrTE is found with 15 copies in CTRL and 2 in SCZ; indeed, there was a strong association (pval = 6.86 × 10−5) between the presence of the nrTE and the absence of the considered trait (schizophrenia). Furthermore, this insertion was shown to act as eQTL [46], therefore suggesting a potential functional mechanism. Accordingly, we could hypothesize that the haplotype with the nrTE has a protective role against the disease. Further in vitro or in vivo experiments could elucidate this potential relationship.

The other haplotype (GC-TTI) is characterized by the absence of the AluYb7 on chr4:23511024, one of the most significant variants also from the allele frequency point of view. This nrTE is completely absent in CTRL samples and present in 7 SCZ patients, only in heterozygous conditions. Therefore, our hypothesis is that the presence of the element is putatively related to an increased risk of developing schizophrenia. Moreover, the Alu is located in the promoter of MIR548AJ2, one of the 108 genome-wide significant loci for schizophrenia reported by Ripke and colleagues [69].

In conclusion, our analysis provides the first overview of nrTEs as DNA variants that are possibly related to an increased risk of developing schizophrenia. We have identified 38 nrTEs of interest, two of them being further confirmed by highly significant haplotype-based analyses. We then highlighted that several of these elements can have a remarkable impact on the expression, alternative splicing, and functionality of nearby genes by cross-checking our results with those available in recently published studies. We also defined two haplotypes in which the presence of the nrTE is either protective against the disease or associated with the schizophrenic condition. Our results, as well as those from other papers dealing with nrTEs, are based on presence/absence of TEs, i.e. considering them as biallelic markers; future research based on long-read sequencing will also need to include TEs sequence variability. We expect that a similar framework applied to a larger cohort of subjects could confirm and possibly extend our results, and experimental validation of the identified nrTEs will elucidate their effective impact on the cognitive genome.

Having identified both reference [25] and non-reference TEs associated with an increased risk to develop schizophrenia suggests that a neurodevelopmental genetic mechanism is at play in the etiopathogenesis of this complex disorder. Under this hypothesis, and given that TEs controlling for the functional architecture of the neural genome are mostly evolutionarily recent (either human-only or primate-specific), schizophrenia can emerge as a trade-off between our ongoing cognitive evolution and possible molecular flaws of our not yet completed evolutionary process.