Regulation potential of transcribed simple repeated sequences in developing neurons

Simple repeated sequences (SRSs), defined as tandem iterations of microsatellite- to satellite-sized DNA units, occupy a substantial part of the human genome. Some of these elements are known to be transcribed in the context of repeat expansion disorders. Mounting evidence suggests that the transcription of SRSs may also contribute to normal cellular functions. Here, we used genome-wide bioinformatics approaches to systematically examine SRS transcriptional activity in cells undergoing neuronal differentiation. We identified thousands of long noncoding RNAs containing >200-nucleotide-long SRSs (SRS-lncRNAs), with hundreds of these transcripts significantly upregulated in the neural lineage. We show that SRS-lncRNAs often originate from telomere-proximal regions and that they have a strong potential to form multivalent contacts with a wide range of RNA-binding proteins. Our analyses also uncovered a cluster of neurally upregulated SRS-lncRNAs encoded in a centromere-proximal part of chromosome 9, which underwent an evolutionarily recent segmental duplication. Using a newly established in vitro system for rapid neuronal differentiation of induced pluripotent stem cells, we demonstrate that at least some of the bioinformatically predicted SRS-lncRNAs, including those encoded in the segmentally duplicated part of chromosome 9, indeed increase their expression in developing neurons to readily detectable levels. These and other lines of evidence suggest that many SRSs may be expressed in a cell type and developmental stage-specific manner, providing a valuable resource for further studies focused on the functional consequences of SRS-lncRNAs in the normal development of the human brain, as well as in the context of neurodevelopmental disorders. Supplementary Information The online version contains supplementary material available at 10.1007/s00439-023-02626-1.


Introduction
Different types of repeats occupy >50% of the ~3 billion nucleotide-long human genome, while only ~1.5% of its capacity is used to encode proteins (Lander et al. 2001;Nurk et al. 2022).The repeated sequences have been traditionally classified as interspersed and tandem repeats, based on their relative positions and the mechanisms driving their expansion.Interspersed repeats often arise through the "selfish" propagation of transposable and retrotransposable elements and their associated sequences (Goodier and Kazazian 2008;Gorbunova et al. 2021).Tandem repeats range from duplications of gene-or gene cluster-sized units to head-to-tail iterations of shorter DNA sequences.Gene duplications can enhance protein activity by increasing the protein production rate or contribute to the processes of evolutionary innovation and speciation (Kent et al. 2003;Lynch and Conery 2000;Nei and Rooney 2005).A notable example of tandem duplication of non-protein-coding genes is provided by the arrays of 47S/45S ribosomal RNA (rRNA) repeats, which are required to sustain the high levels of ribosome biogenesis in the cell (McStay 2023;Nemeth and Grummt 2018).
The biological functions of shorter tandem repeats are less well understood.Depending on the length of their repeated units, this group is often classified into microsatellites (≤9-bp units), minisatellites (10-60-bp units), and satellites (>60-bp units; Wright and Todd 2023).These boundaries may differ depending on the study and the organism, and alternative terms such as short tandem repeats and simple sequence repeats are often used as synonyms for microsatellites.To avoid confusion, we will collectively refer to all tandem repeats with microsatellite-, minisatellite-, or satellite-sized units as "simple repeated sequences" (SRSs).
Defined in this way, SRSs account for ~7.5% of the total human genome length (Nurk et al. 2022).These repeats, particularly satellites, tend to be enriched in the transcriptionally repressed heterochromatin and associated with genome maintenance functions (Altemose 2022;Ugarkovic et al. 2022).However, at least some SRSs are known to be expressed, giving rise to biologically active transcripts (Ninomiya and Hirose 2020;Trigiante et al. 2021).Since individual SRSs contain multiple repeated units, transcripts containing these sequences can form multivalent contacts with cognate proteins or nucleic acid sequences.This in turn may allow SRS-containing transcripts to act as regulators of cellular RNA metabolism or/and contribute to the assembly of large ribonucleoprotein complexes and membraneless compartments.
Many SRSs are genetically unstable and frequently increase or decrease the number of repeated units as a result of replication or/and recombination errors.The genetic expansion of microsatellites in transcribed genomic regions is known to give rise to toxic RNAs in several neuromuscular and neurodegenerative disorders (Baud et al. 2022;Ciesiolka et al. 2017;Fujino and Nagai 2022;Schwartz et al. 2021).For example, the type 1 myotonic dystrophy (DM1) is caused by the expansion of the CTG trinucleotide repeat in the 3′ untranslated region (3′UTR) of DMPK gene (Meola and Cardani 2015;Sznajder and Swanson 2019;Yum et al. 2017).The aberrant DMPK transcripts containing >50 and, occasionally, up to thousands of CUG units can form a hairpin structure stabilized by the G-C base-pairing.
Furthermore, expanded CUGs comprise numerous YGCY motifs (CUG CUG ) recognized by the muscleblindlike (MBNL) RNA-binding proteins (RBPs; Mankodi et al. 2001;Miller et al. 2000).This allows for multivalent MBNL binding to the mutant DMPK transcripts and the assembly of pathological ribonucleoprotein foci in the nucleus.The sequestration of MBNL proteins and potentially other RBPs results in dysregulation of pre-mRNA splicing and cleavage/ polyadenylation, as well as mRNA stability (Goodwin et al. 2015;Jiang et al. 2004;Masuda et al. 2012).
Pathological RNA foci and MBNL sequestration are also observed in the type 2 myotonic dystrophy (DM2), which is caused by the expansion of the CCTG repeats in intron 1 of the CNBP/ZNF9 gene (Liquori et al. 2001;Mankodi et al. 2001;Sznajder and Swanson 2019).Other repeat expansion disorders have been shown to involve the production of toxic RNAs interacting with multiple copies of different RBPs.For example, the expansion of the CAG-repeat in Huntington's disease (Nalavade et al. 2013), the GGG GCC -repeat in amyotrophic lateral sclerosis and frontotemporal dementia (Balendra and Isaacs 2018), and the CGG-repeat in Fragile X-Associated Tremor/Ataxia Syndrome (Cid-Samper et al. 2018;Sellier et al. 2010), have been shown to promote the formation of RNA foci and sequester different protein factors.
The biomedical relevance of SRSs is not limited to repeat expansion disorders.We have previously identified a UC repeat-enriched long noncoding RNA (lncRNA), PNCTR, produced by RNA polymerase I (Pol I) from an intergenic spacer separating tandem copies of 47S rRNA genes (Yap et al. 2018).We showed that PNCTR is often upregulated in transformed cells, driving the assembly of the cancerspecific perinucleolar compartment (PNC).The UC-rich sequence elements within PNCTR enable multivalent binding of the RBP PTBP1, sequestering it within the PNC.This in turn antagonizes the splicing regulation function of PTBP1 and promotes cancer cell survival (Yap et al. 2018).Using a proximity labeling approach, we have recently shown that PNCTR and the PNC are associated with a host of additional proteins involved in nucleic acid metabolism (Yap et al. 2022a, b).
SRS transcription can also contribute to normal cellular and organismal processes.A classic example is provided by the lncRNA XIST required for X chromosome inactivation (XCI) in female mammalian cells (Patrat et al. 2020).XIST contains several conserved SRS-like elements, referred to as A-, F-, B-, C-, D-, and E-repeats, which may facilitate its function through multivalent RBP recruitment (Almeida et al. 2017;Chu et al. 2015;Lu et al. 2020;McHugh et al. 2015;Pintacuda et al. 2017;Sakaguchi et al. 2016).Another example of a lncRNA expressed in healthy cells is the telomeric repeat-containing RNA TERRA.This lncRNA enriched in UUA GGG repeats plays a crucial role in telomere maintenance by recruiting a specific set of proteins and forming R-loop structures with the telomeric DNA (Arnoult et al. 2012;Deng et al. 2009;Graf et al. 2017;Mei et al. 2021;Porro et al. 2014;Silva et al. 2021).Knockdown of TERRA can lead to telomeric defects, chromosome abnormality, and cell death (Barral and Dejardin 2020;Deng et al. 2009).
Similar to telomeres, centromeric and pericentromeric DNA is enriched in SRSs.Transcripts containing centromeric satellite repeats show a considerable sequence diversity across species but play a conserved role in the assembly of the kinetochore and the centromere passenger complex (CPC; Leclerc and Kitagawa 2021;Perea-Resa and Blower 2018).One common type of centromeric repeats in human is known as alpha satellites, which are characterized by a consensus sequence unit of 171 bp in length (McNulty and Sullivan 2018).Transcripts containing alpha satellites, and possibly other centromeric SRSs, have been shown to associate with the CPC components Aurora B and INCENP, the kinetochore component CENP-C, and the centromeric histone 3 variant CENP-A (Blower 2016;Ideue et al. 2014;McNulty et al. 2017;Quenet and Dalal 2014;Wong et al. 2007).
Knockdown of alpha satellite RNA downregulates these centromeric proteins, resulting in mitotic defects and cell cycle arrest (McNulty et al. 2017;Quenet and Dalal 2014;Wong et al. 2007).Additionally, alpha satellite RNA can associate with SUV39H1, which deposits the repressive heterochromatin marks and recruits the heterochromatin protein 1α (HP1α; Johnson et al. 2017).Therefore, transcribed alpha satellites can act as scaffolds, facilitating the recruitment of proteins involved in centromeric and pericentromeric chromatin maintenance and chromosome segregation.
The above examples indicate that the normal biological functions of SRS-containing transcripts have been extensively studied in proliferating cells.On the other hand, the effects of abnormally expanded SRSs are better understood in diseases affecting differentiated cells, including neurons.The possible roles of transcribed SRSs in healthy neurons remain largely unexplored, partly due to the challenges associated with analyzing repeat-rich sequences.To address this limitation, we systematically identified SRS-containing lncRNA (SRS-lncRNA) candidates upregulated during neuronal development by mining RNA-sequencing (RNA-seq) data.We further validated our bioinformatics predictions using a newly established system for rapid neuronal differentiation of human induced pluripotent stem cells (iPSCs).Our work provides a valuable resource for further studies focused on the roles of repeat transcription in the development and function of the nervous system.

DNA constructs
The pZT-C13-L1 and pZT-C13-R1 constructs encoding the left and right TALENs specific to the CLYBL safe-harbor locus were a gift from Jizhong Zou (Addgene plasmid #62196 and #62197; Cerbini et al. 2015).The CLYBL-specific homology directed repair construct pUCM-CLYBL-hNIL was a gift from Michael Ward (Addgene plasmid #105841; Fernandopulle et al. 2018).We modified the pUCM-CLYBL-hNIL backbone to act as an acceptor locus in the recombination-mediated cassette exchange (RMCE) protocol for the high-efficiency integration of transgenes of interest (Iacovino et al. 2011).We used standard molecular cloning techniques and restriction and modification enzymes from New England Biolabs to substitute the hNIL fragment with a Lox2272-and LoxP-flanked Cre recombinase gene linked to the ΔNeoR and PuroR markers.The map of the resultant pML630 plasmid is provided Supplementary Data S1.The mouse Ngn2-encoding RMCE knock-in plasmid pML156 was generated as described previously (Zhuravskaya et al. 2023).

Generating TRE-Ngn2 iPSCs
We prepared TRE-Ngn2 iPSCs expressing an Ngn2 transgene from a Dox-inducible promoter using a twostep approach previously used for mouse ESCs (Iacovino et al. 2011).In our case, the first step involved knocking in the rtTA-2Lox-Cre cassette encoded by pML630 into the CLYBL safe-harbor locus, and the second step, high-efficiency RMCE substituting the Cre coding sequence with the pML156-encoded Ngn2.
In the first step, a ~20%-confluent wild-type iPSC culture in a 12-well plate (Corning, cat# 3513) containing 1 ml/well of Essential 8 supplemented with 10 μM ROCK inhibitor (Merck, cat# Y0503) was co-transfected with the pZT-C13-L1 and pZT-C13-R1 plasmids (Cerbini et al. 2015) and the pML630 construct mixed in the 1:1:8 ratio, respectively.We combined 2 μg of the plasmid mixture with 2 μl of Lipofectamine Stem Transfection Reagent (Thermo Fisher Scientific, cat# STEM00008) and 100 μl of Opti-MEM I (Thermo Fisher Scientific, cat# 31985070), as recommended.The resultant transfection mixture was then added drop-wise to the cells.The medium was refreshed on the next day.To select knock-in clones, 0.25 μg/ml puromycin was added to the medium 2 days post transfection and gradually increased to 0.75 μg/ml by day 12. Puromycin-resistant colonies were picked 12 days post transfection, expanded, and genotyped by PCR using the MLO3670/MLO3671 and MLO3686/MLO1631 primer pairs (Table S1).We also confirmed that the clones express Cre in a Dox-inducible manner by reverse transcription (RT) qPCR using human GAPDH as the "housekeeping" control (see Table S1 for primer sequences).
In the second step, the rtTA-2Lox-Cre knock-in cells were pre-treated overnight with 2 μg/ml doxycycline (Dox; Sigma, cat# D9891) to activate Cre expression.The cells were then transfected with a mixture containing 1 μg of pML156, 2 μl of the Lipofectamine Stem Transfection Reagent, and 100 μl of Opti-MEM I, as described above.
To select RMCE-positive clones, 25 μg/ml G418/geneticin (Sigma, cat# 10,131,019) was added to the medium 2 days post transfection and gradually increased to 60 μg/ ml by day 12. G418-resistent colonies were picked 2 weeks post-transfection, expanded and genotyped by PCR with the MLO1295/MLO1296 primers (Table S1).We also confirmed the Dox-inducible expression of the Ngn2 transgene by RT-qPCR with the same primer pair and using human GAPDH as the as the "housekeeping" control.Three TRE-Ngn2 iPSC clones selected in this manner were used for the Dox-induced differentiation experiments described below.

Immunofluorescence
Cells grown on 18-mm coverslips were washed briefly in PBS, fixed with 4% formaldehyde (Thermo Fisher Scientific, cat# 28908), washed three times with PBS, permeabilized with 0.1% Triton X-100 in PBS for 10 min, and washed three times with PBS.Fixed and permeabilized cells were then blocked in PBS containing 0.5% BSA and 0.2% Tween-20 (IF blocking buffer) for 30 min at room temperature and incubated with anti-NGN2 (Cell Signaling Technology, cat# 13144; 1:250 dilution) and/or anti-MAP2 (Biolegend, cat# 822501; 1:2000 dilution) antibodies in the IF blocking buffer at 4 °C overnight.The samples were washed once with 0.2% Tween-20 in PBS and twice with PBS, and incubated with Alexa Fluor-conjugated secondary antibodies in IF blocking buffers for 1 h at room temperature.This was followed by one wash with 0.2% Tween-20 in PBS and two washes with PBS.The coverslips were then counterstained with 0.5 μg/ml DAPI in PBS for 3 min and mounted with Pro-Long Gold antifade reagent (Thermo Fisher Scientific, cat# P36934).Images were taken using a ZEISS Axio Observer Z1 Inverted Microscope with a LD Plan-Neofluar 40x/0.6 Corr Ph 2 M27 objective.

RNA fluorescence in situ hybridization (RNA-FISH)
In some experiments, immunofluorescently stained cells were post-fixed with 4% formaldehyde for 15 min at room temperature, washed three times with PBS, and transferred to the pre-hybridization buffer containing 10% formamide (Thermo Fisher Scientific, cat# 15515026) and 2× SSC (Thermo Fisher Scientific, cat# AM9763).The samples were then hybridized at 37 °C overnight with a 125-nM mixture of RNA target-specific oligonucleotide probes (Table S1) labeled with digoxigenin (Sigma-Aldrich, cat# 03353575910), which were dissolved in the hybridization buffer containing 10% formamide, 2 × SSC, and 10% dextran sulfate (Sigma-Aldrich, cat# D8906).Following hybridization, the samples were washed for 30 min in 2× SSC and 10% formamide at 37 °C, and 15 min in 1× SSC at room temperature.Subsequently, the samples were incubated with mouse anti-digoxigenin antibody (Jackson Laboratories, cat# 200-002-156, 1:500) in 4× SSC and 0.8% BSA for 1 h at 37 °C.Following this incubation, additional washes were performed with 4× SSC, 4× SSC and 0.1% Triton X-100, and 4× SSC at room temperature for 10 min each.The samples were then incubated with Alexa Fluor-647-conjugated anti-mouse secondary antibodies (Thermo Fisher Scientific, cat# A31571, 1:300) in 4× SSC and 0.8% BSA for 1 h at room temperature, followed by washes with 4× SSC, 4× SSC and 0.1% Triton X-100, and 4× SSC at room temperature for 10 min each.Finally, the samples were counterstained with 0.5 μg/ml DAPI in PBS for 3 min and mounted with ProLong Gold antifade reagent.Images were captured using the same ZEISS system as described above, switching to an alpha Plan-Apochromat 100x/1.46Oil DIC (UV) M27 objective.

PCR-based assays
PCR genotyping was carried out using the PCRBIO HS Taq Mix Red kit (PCR Biosystems, cat# PB10.23-02), as recommended by the manufacturer.To perform RT-qPCR assays, total RNAs were extracted with the EZ-10 DNAaway RNA Mini-Preps kit (Bio Basic, cat# BS88136) as recommended, and reverse transcribed in a 20-μl format.Prior to reverse transcription, traces of genomic DNA were removed by treating 600 ng of total RNA with 1 μl RQ1 RNase-Free DNase (Promega, cat# M6101) in 9 μl reaction additionally containing RQ1 DNase buffer and 1 μl of murine RNase inhibitor (New England Biolabs, cat# M0314L) at 37 °C for 45 min.RQ1 DNase was inactivated by adding 1 μl of Stop Solution from the RQ1 kit and incubating the solution at 65 °C for 10 min.The reaction was placed on ice, supplemented with 1 μl of 100 μM random decamer primers, incubated at 70 °C for 10 min, and returned to ice.The reaction was then supplemented with 1× SuperScript IV reaction buffer, 10 mM DTT, 0.5 mM each of the four dNTPs, 1 μl murine RNase inhibitor, 1 μl of SuperScript IV Reverse Transcriptase (Thermo Fisher Scientific, cat# 18090200), and nuclease-free water (Thermo Fisher Scientific, cat# AM9939) added to the final volume of 20 μl.In RT-negative controls, reverse transcriptase was substituted with an equal volume of nuclease-free water.cDNA was synthesized by incubating the RT mixtures at 50 °C for 40 min, followed by a 10-min 70 °C heat inactivation step.The samples were finally diluted tenfold by adding 180 μl of nuclease-free water and stored at −80 °C until needed.Quantitative PCR (qPCR) was performed using qPCRBIO SyGreen Mix Lo-ROX (PCR Biosystems, cat# PB20.11-51), 100 nM of each primer and 5 μl of diluted cDNA in 20 μl reactions.Reactions were run on a LightCycler 96 Instrument (Roche).The YWHAZ gene was used as the "housekeeping" control.Primers used for PCR genotyping and RT-qPCR are listed in Table S1.
The transcript_type tag "protein_coding" from the Cuffmerge GTF file output was used to define transcripts and their corresponding genes as protein-coding.The remaining transcripts/genes were considered non-coding.To shortlist SRS-containing lncRNAs, Simple Repeat track data were downloaded using the UCSC Genome Browser table browser tool selecting the Simple Repeat track for the GRCh38/hg38 genome assembly.Only repeated sequences of >200 bp in length and comprising ≥3 repeated units were included in downstream analyses.Transcripts consisting entirely of repetitious sequences were filtered out because of the inherent difficulty in quantifying their abundance.We used the same strategy to categorize protein-coding genes as containing or lacking SRSs in the transcribed region in the analyses illustrated in Fig. 3.
To identify RBP-specific sequence motifs, we used FIMO (Bailey et al. 2015)  To process nELAVL HITS-CLIP data (Scheckel et al. 2016), sequencing reads were aligned to the hg38 genome, and the CLIP peaks were identified using the CLIP Tool Kit (CTK; Shah et al. 2017).Briefly, BAM files were first converted to the BED format and pooled: Fig. 2 Genomic distribution of SRS-lncRNAs.Box plots showing the distribution of A all genome-encoded SRSs of >200 bp in length and containing ≥3 repeated units, B SRSs in all detectably expressed lncRNAs, and C SRSs in neurally upregulated lncRNAs along the human chromosome arms separated into 10 equally sized bins, form the middle of the centromere (position 0) to the end of the telomere (position 1).The p-arms of the acrocentric chromosomes 13, 14, 15, 21 and 22 encoding the 47S/45S rRNA arrays were excluded from these analyses.D The Kolmogorov-Smirnov (KS) test shows that the distributions of all SRS-lncRNAs (blue) and neural SRS-lncR-NAs (red) along the centromere-to-telomere axis do not significantly differ on a genome-wide scale.The data are presented as empirical cumulative distribution function (ECDF) plots.Karyoplots showing the color-coded density of E all detectably expressed SRS-lncRNAs and F neurally upregulated SRS-lncRNAs, calculated for individual human chromosomes as a number of loci per 5-Mb window.The centromere-to-telomere distributions of the neural and all detectably expressed SRS-lncRNAs were compared for individual chromosome arms using the KS test.A significant difference was detected only for chromosome 9 (chr9).The panels were generated using the Rideogram package (https:// cran.r-project.org/ web/ packa ges/ RIdeo gram/ vigne ttes/ RIdeo gram.html).G A close-up of the centromere-proximal part of chr9 encoding twelve neurally upregulated SRS-lncRNAs organized as six segmentally duplicated bidirectional transcription units (arrows).The panel also displays strand-specific countsper-million (CPM) normalized coverage plots for the nuclear NPC RNA-seq data.The alpha satellite-and GAAAT repeat-containing SRS-lncRNAs, which constitute the bidirectional unit, are highlighted in magenta and teal, respectively.The coverage plot on the top represents RNAs transcribed in the forward direction, while the bottom part shows RNAs transcribed in the reverse direction The resultant half_peak_height.bedfile was used to estimate the overlap between nELAVL HITS-CLIP peaks and predicted ELAVL3 binding motifs.

Statistical analyses
Statistical analyses were carried out in R (v. 4.2.2;R Core Team 2021).Data were compared using two-sided Student's t test or Wilcoxon signed-rank test, or one-sided Wilcoxon rank sum test.Multiple comparisons were adjusted for multiple testing using the Benjamini-Hochberg (FDR) method.Alternatively, we used one-way ANOVA followed by Tukey's post hoc test or the Kruskal-Wallis test followed by Dunn's post hoc test, as indicated in the figures.SRS-lncRNA distributions along the centromeric-to-telomeric axis were compared using Kolmogorov-Smirnov test.Fisher's exact test was used to compare categorical data.P values <0.05 were considered statistically significant.

Widespread regulation of SRS-lncRNA expression during neuronal differentiation
To investigate the transcriptional status of SRSs encoded in the human genome, we examined a publicly available RNA-sequencing (RNA-seq) dataset for human embryonic stem cells (ESCs) undergoing in vitro differentiation into forebrain-specific neurons (Blair et al. 2017).We selected this study since it contains high-quality sequencing data for four differentiation stages: ESCs, neural progenitor cells (NPCs), and neurons at two stages of maturation, 14 days (Neuronal 14 ) and 50 days post-differentiation from NPCs (Neuronal 50 ).Furthermore, we reasoned that the nuclearand cytoplasmic-fraction data provided by this study should increase the likelihood of identifying compartment-enriched transcripts.
Our RNA-seq analysis pipeline was designed to handle multi-mapping reads and incorporated reference-guided transcriptome assembly.This allowed us to include both previously annotated and novel RNAs (Fig. 1A; see "Methods" for further details).We defined SRS-containing lncRNAs as transcripts that lack a known protein-coding sequence and contain > 200 nt-long simple repeat(s) (with at least 3 repeated units) sourced from the UCSC Genome Browser database.The choice of the >200-nt cutoff was based on the traditional definition of lncRNAs (Mattick et al. 2023).To mitigate the artifacts of the ambiguous alignment of multimapping reads, we excluded newly predicted transcripts composed entirely of repeated sequences.This uncovered a total of 5430 SRS-lncRNA candidates expressed in the nucleus or/and cytoplasm in at least one of the differentiation stages with the >0.1 TPM cutoff.
To identify differentially expressed SRS-lncRNAs, we compared the NPC, Neuronal 14 or Neuronal 50 samples to the corresponding ESC controls (Fig. 1A).This revealed 899 non-redundant candidates that were upregulated either in the nucleus or the cytoplasm (>twofold, FDR < 0.05) in at least one neural sample (i.e. in NPC, Neuronal 14 or Neuronal 50 ; Table S2).Among the upregulated SRS-lncR-NAs, 289 SRS-lncRNAs were upregulated in the NPC nuclei and 402 and 465 in the Neuronal 14 and Neuronal 50 nuclei, respectively.Similarly, 309 SRS-lncRNAs were upregulated in the NPC cytoplasm and 411 and 522 in the Neuronal 14 and Neuronal 50 cytoplasm (Fig. 1B, C).Our pipeline also identified 444 ESC-specific candidates that were consistently downregulated at all the neural stages in at least one compartment (>twofold, FDR < 0.05) and not significantly upregulated in the other compartment (Table S3).
The SRS-lncRNAs shortlisted in this manner included previously characterized transcripts with relevant expression patterns.For instance, LINC00632 (XLOC_334612), the precursor of the brain-enriched circular RNA CDR1as/ CiRS-7 containing multiple microRNA miR-7-interacting sequences (Barrett et al. 2017;Hansen et al. 2013;Memczak et al. 2013), was upregulated in NPCs and neurons (Fig. S1A; Table S2).Conversely, the lncRNA CPMER (XLOC_185088) involved in cardiomyocyte differentiation and expressed in embryonic stem cells (Lyu et al. 2022), as well as the p53-regulated pluripotency-specific lncRNA LNCPRESS2 (XLOC_222487; Jain et al. 2016) were downregulated in neural samples (Fig. S1B, C; Table S3).These examples provided internal controls for the performance of our pipeline.Based on the inspection of UCSC Genome Browser data, many other shortlisted transcripts were either novel (e.g., Fig. S2A, B) or matched known lncRNAs not previously associated with the neural lineage (e.g., Fig. S2C).
These data suggest that many SRS-containing transcripts change their expression during normal neuronal differentiation.
Fig. 3 Bioinformatics analyses of SRS-lncRNAs.A The density of predicted RBP motifs is significantly higher in the SRS parts of SRS-lncRNAs compared to the non-SRS parts of the same transcript.B Individual SRS-lncRNAs expressed in the neural lineage frequently contain ≥10 interaction motifs for more that one type of RBP.Ratios between transcript abundances in the cytoplasm and the nucleus for protein-coding and lncRNA genes containing or lacking SRSs calculated for C ESC, D NPC, E Neuronal 14 and F Neuronal 50 samples.All ratios are normalized by the differentiation stage-specific median of the non-SRS protein-coding group.The data are compared by the Kruskal-Wallis test (P = 0 for all panels) with Dunn's post hoc test (P values shown in the panels).The data in A-F are presented as box plots, with outliers shown in B but not the other panels ◂

SRS-lncRNAs upregulated in neural cells originate from specific genomic regions
To gain initial insights into the 899 SRS-lncRNAs upregulated in NPCs or/and neurons, we compared the positions of their loci with those of all SRSs encoded in the genome and all detectably expressed SRS-lncRNAs.Plotting the overall genomic distribution of SRSs along the centromereto-telomere axis revealed their expected enrichment in the centromere-and telomere-proximal regions (Fig. 2A).In comparison, all detectable SRS-lncRNAs lost the centromere-proximal peak and showed significant enrichment in a broader telomere-proximal region (Fig. 2B).The subset of neural SRS-lncRNAs exhibited a generally similar profile (Fig. 2C, D).Although we excluded the candidates consisting entirely of repetitious sequences from our main analyses, adding such transcripts back to the shortlist had little effect on the overall genomic distribution of neural SRS-lncRNA (Fig. S3; Table S2).
Our further inspection of individual chromosomes revealed specific enrichment of neural SRS-lncRNAs in centromere-proximal parts of chr9 comprising evolutionarily recent segmental duplications (Fig. 2E, F; Bailey et al. 2002;Crosier et al. 2002;Guy et al. 2000).Interestingly, twelve SRS-lncRNAs encoded in this region are organized into six bidirectional transcription units sharing considerable sequence similarity.Within each unit, one SRS-lncRNA typically contains a 10-36 kb-long alpha-satellite sequence, while the SRS-lncRNA transcribed in the opposite direction has a 3-8 kb-long GAAAT-rich microsatellite repeat (Fig. 2G).Of note, the chromosome-specific patterns of neural SRS-lncRNAs were clearly distinct from the distribution of neurally upregulated protein-coding genes (Fig. S4).
Thus, neural SRS-lncRNA loci are encoded in the genome in a non-random manner.

Bioinformatics characterization of neural SRS-lncRNAs
Several previously characterized SRS-containing transcripts have been shown to recruit multiple copies of specific RBPs (e.g., Yap et al. 2018).With this in mind, we conducted a sequence motif analysis and found that 695 out of the 899 neurally upregulated SRS-lncRNAs contain ≥10 putative interaction sites for individual RBPs within their SRS regions (Table S4; see "Methods" for further details).Importantly, the motif density was significantly higher in the SRS regions compared to the non-SRS parts of the same SRS-lncRNA (Fig. 3A).Furthermore, many SRS-lncRNAs were predicted to engage in multivalent interactions with more than one distinct type of RPBs, with a median of 5 different RBPs (Fig. 3B).
To benchmark our motif predictions, we turned to neuronal ELAV-like RBPs (nELAVL; ELAVL2/HuB, ELAVL3/ HuC, and ELAVL4/HuD), whose U/G-rich RNA binding sites have been experimentally identified in the human brain using the HITS-CLIP approach (Scheckel et al. 2016).Based on our predictions, 190 SRS-lncRNAs may form multivalent contacts with members of this protein family (see the ELAVL3 motif in Table S4).Notably, a large fraction of the predicted ELAVL3 motifs in SRS-lncRNAs overlapped with the nELAVL HITS-CLIP peaks, and the extent of such overlap was significantly diminished when we computationally "scrambled" the HITS-CLIP peak positions (Fig. S5).
Since many lncRNAs are retained in the nucleus (Guo et al. 2020;Palazzo and Lee 2018;Tong and Yin 2021), we compared the intracellular distribution of SRS-lncRNAs to that of other types of transcripts.As expected, the cytoplasmic-to-nuclear ratio was noticeably lower for SRS-lncRNAs than for transcripts of protein-coding genes with or without >200 nt-long SRSs.However, SRS-lncRNAs were relatively more abundant in the cytoplasm compared to lncR-NAs lacking qualifying SRSs.Interestingly, the presence of such repeats in protein-coding genes led to the opposite effect-a decrease in the cytoplasmic-to-nuclear ratio.These effects were observed for transcripts detectably expressed at all four stages of differentiation (Fig. 3C-F; >0.1 TPM in the nuclear fraction at the corresponding stage).
These data suggest that neural SRS-lncRNAs have a strong potential for multivalent RBP recruitment.Furthermore, our analyses indicate that SRSs may control the relative abundance of RNA in the nucleus and cytoplasm in a transcript-dependent manner.

Doxycycline-inducible differentiation of human iPSCs into neurons
To validate the SRS-lncRNA regulation patterns, we generated a stable human iPSC line with a doxycycline (Dox) inducible neurogenin 2 (Ngn2) transgene (Fig. S6A).Ectopic expression of Ngn2 is known to trigger efficient neuronal differentiation of proliferating stem cells in vitro (Fernandopulle et al. 2018;Lin et al. 2021;Zhang et al. 2013).Inspired by the high-efficiency RMCE-based knockin technology available for mouse ESCs (Iacovino et al. 2011), we first knocked in an "acceptor" cassette into the CLYBL safe-harbor locus of wild-type human iPSCs using the appropriate TALEN and homology-directed repair constructs (Fig. S6A).The cassette contained a puromycin selection marker, a constitutively expressed reverse tetracycline transactivator (rtTA) and a tetracycline/doxycycline responsive element (TRE) promoter driving the expression of a lox2272 and loxP site-flanked Cre recombinase gene.To enable the subsequent RMCE-dependent integration step, the cassette also encoded a fragment of the G418 resistance After selecting the rtTA-2Lox-Cre line resistant to puromycin (but not G418), we proceeded with the RMCEdependent knock-in step.The cells were treated with Dox to induce Cre expression and transfected with the pML156 "donor" plasmid containing the mouse Ngn2 along with the lox2272 and loxP sites (Fig. S6A).The pML156 design also precluded transient expression of Ngn2 that might promote unwanted iPSC differentiation.Cre-mediated RMCE was expected to integrate the Ngn2 transgene under the TRE promoter, replacing the Cre gene.Furthermore, the ΔNeoR fragment was expected to acquire a start codon and the human PGK promoter (hPGK), making the resultant TRE-Ngn2 cells G418 resistant.We confirmed Ngn2 integration by PCR-genotyping of G418-resistant clones (Fig. S6B).Moreover, TRE-Ngn2 cultures treated with Dox for 48 h expressed readily detectable amounts of the NGN2 protein and the neuronal marker MAP2 (Fig. S6C).Expression of these proteins was not detected in mock-treated cells (Fig. S6C).
To induce neuronal differentiation, we incubated the TRE-Ngn2 cells with Dox for three days and then maintained the cultures until day 14 in a medium supplemented with the neurotrophic factors BDNF and NT-3 (Fig. 4A).RT-qPCR analyses of RNA samples collected on differentiation days 0, 1, 2, 3, 7, and 14 showed that the cells progressively lost the expression of the pluripotency markers POU5F1 and NANOG (Fig. S6D-E) and gained the expression of an early neuronal marker, NCAM1, beginning from differentiation day 1 (Fig. S6F).The mature neuronal marker SYN1 and the glutamatergic marker SLC17A6/VGLUT2 were upregulated on differentiation day 7 and continued to be expressed on day 14 (Fig. S6G, H).
We concluded that Dox-induced TRE-Ngn2 cultures recapitulated the natural dynamics of gene expression in developing neurons.

The expression of neural SRS-lncRNAs is regulated in a developmental stage and cell type-specific manner
We utilized the newly established neuronal differentiation system to validate our bioinformatic predictions.TRE-Ngn2 iPSCs were differentiated as outlined in Fig. 4A and analyzed by RT-qPCR on differentiation days 0, 1, 2, 3, 7, and 14, using appropriate primers.We first investigated the clustered SRS-lncRNAs from the segmentally duplicated part of chr9 (Fig. 2G).According to the RNA-seq data, both members of the bidirectionally transcribed unit were expected to be upregulated in NPCs and retain detectable expression in neurons, albeit at a somewhat decreased level (Fig. 4B).
Our RT-qPCR analyses of the TRE-Ngn2 time series with primers annealing upstream of the GAAAT and alpha-satellite repeats confirmed the upregulation of the corresponding SRS-lncRNAs (e.g., XLOC_320267 and XLOC_312995; Fig. 2G) during neuronal differentiation (assays 1 and 2; Fig. 4B-D).Similar results were obtained when we repeated the analysis using primers designed against downstream sequences (assays 3 and 4; Fig. 4B, E, F).Of note, RT-qPCR showed a higher expression of the two types of SRS-lncR-NAs in day-14 neurons compared to the earlier time points.The apparent difference of this dynamics from Fig. 4B might be due to possible variability between the ESC and the TRE-Ngn2 iPSC differentiation protocols.
To confirm the upregulation of XLOC_312995 and XLOC_320267 in neurons, we analyzed differentiation day 0 and 14 samples by RNA-FISH (Fig. S7).Although this approach detected XLOC_312995-and XLOC_320267specific nuclear foci in both iPSCs and neurons, the foci tended to be noticeably larger in neurons compared to iPSCs (Fig. S7).
To explore the specificity of SRS-lncRNA expression, we mined Encode RNA-seq data and shortlisted SRS-lncRNAs upregulated in diverse types of differentiated human cells.The Encode list for neurons showed a robust overlap with the neurally upregulated SRS-lncRNAs selected by our bioinformatics pipeline (321 out of 899; Fig. 6A; Table S5).Other cell types, including astroglia, endothelial, and endodermal cells, also shared considerable numbers of common SRS-lncRNAs with the neurally upregulated SRS-lncRNAs (Fig. 6B, C; Table S5), but the overlaps in these cases were smaller compared to Fig. 6A.Notably, 129 out of the 321 overlapping neuronal SRS-lncRNAs were not upregulated in the other cell types (Table S5).This subset included, for instance, the XLOC_039609/KRTAP5-AS1 transcript (Fig. S2) and two out of six GAAAT repeat-containing transcripts from the chr9 SRS-lncRNA cluster (XLOC_312965 and XLOC_319986; Fig. 2G).
Thus, many neural SRS-lncRNAs predicted by our pipeline appear to be expressed in a developmental stage and cell type-specific manner, and their expression can be perturbed in disease.

Discussion
Our study argues that numerous long noncoding transcripts containing extensive stretches of simple repeated sequences are expressed in genetically normal human pluripotent stem cells and developing neurons (Fig. 1).Many of these SRS-lncRNAs originate from telomere-proximal regions, despite the abundance of SRSs in both centromere-and telomere-proximal DNA (Fig. 2).While additional work will be required to understand the molecular basis of this genome-wide trend, possible underlying reasons might include more accessible structure or/and more favorable epigenetic modifications of the telomere-proximal chromatin compared to its centromere-proximal counterpart.
A large fraction of the SRS-lncRNAs predicted by our bioinformatics pipeline is significantly upregulated during normal neuronal differentiation, and at least some of these transcripts appear to be specific to neurons (Figs. 1,4,5,6;Figs. S10,S11;Tables S2, S5).Moreover, a subset of SRS-lncRNAs appears to be deregulated in the context of ASD (Fig. S12; Table S6).Similar to other SRS-lncRNAs, many neurally upregulated members of this type of transcripts tend to be encoded in telomere-proximal regions of the genome.An important exception is a broad centromereproximal region of chr9 that gives rise to a significantly larger number of neural SRS-lncRNAs than expected by chance (Fig. 2E, F).The SRS-lncRNAs encoded in this region are organized as bidirectional pairs of the alpha satellite-and GAAAT repeat-containing transcripts expressed from a common GC-rich promoter (Figs. 2G, 4B).This part of the genome is known to be segmentally duplicated in primates (Bailey et al. 2002;Crosier et al. 2002;Guy et al. 2000), suggesting an intriguing possibility that it encodes primate-specific functions.Additional studies will be needed to test this hypothesis and explore the mechanisms underlying the upregulation of these SRS-lncRNAs in neurons.
Our RBP motif analyses suggest that the chr9-derived and other SRS-lncRNAs upregulated in NPCs and neurons can recruit multiple copies of specific RBPs via SRS-enriched cognate sequence motifs (Fig. 3A, B; Table S4).For example, the medium numbers of multivalent RBP motifs in the SRS parts of the alpha-satellite and GAAAT-repeat SRS-lncRNA families of the chr9 transcripts are 46 and 31.5, respectively (Table S4).The analysis of nELAVL RNAbinding preferences suggests that many computationally predicted motifs serve as bona fide sites for RBP recruitment (Fig. S5; Scheckel et al. 2016).Further validation of such interactions and understanding their possible role in the RNA metabolism of developing neurons will be important directions for future studies.
It will be also interesting to follow up on our finding that SRSs are associated with a decrease in the cytoplasmic-tonuclear ratio for transcripts of protein-coding genes, while having the opposite effect on lncRNAs (Fig. 3C-F).We hypothesize that this difference relates to the abundance of introns in the former group and their paucity in the latter.Indeed, at least some SRSs are known to interfere with intron excision and mRNA export from the nucleus to the cytoplasm (Monteuuis et al. 2019;Sznajder et al. 2018;Yap et al. 2012).
Although SRS-lnRNAs are still more commonly found in the nucleus compared to protein-coding transcripts (Fig. 3C-F; Fig. S7), it is possible that SRS-lnRNAs "escaping" the nucleus function in the cytoplasm.A previously characterized example of a repeat-containing lncRNA that acts in both the nucleus and the cytoplasm is NORAD (Elguindy and Mendell 2021;Lee et al. 2016;Munschauer et al. 2018;Tichon et al. 2016).This conserved transcript lacks classically defined SRSs, but contains 18 UGU RUA UA motifs that may have originated from ancient duplication events.NORAD utilizes these sequences to sequester Pumilio-family RBPs, thereby altering mRNA stability in the cytoplasm.
In conclusion, we have identified multiple instances of simple repeated sequences, which are transcribed in a developmentally regulated and cell type-specific manner.We that comprehensive analyses of the expression dynamics, cellular localization, and interaction partners of neural SRS-lncRNAs using the TRE-Ngn2 system (Fig. S6) and other experimental approaches will illuminate their functional significance in the normal development of the human brain, as well as in neurodevelopmental disorders.
Day-7 differentiating TRE-Ngn2 cultures grown in the 12-well plate format were transfected with 50 pmol/well of an XLOC_312995-specific (gmXLOC_312995; Qiagen, cat# 339511 LG00797536-DDA; 5′-G*T*A*G*T*A*G*G*T* G*G*T*C*T*T*T; the asterisks indicate phosphorothioate bonds; the positions of the LNA modifications are not provided by the manufacturer) or a negative control gapmer (gmControl; Qiagen, cat# 339516 LG00000002-DFA; 5′-A*A*C*A*C*G *T*C*T*A*T*A*C*G*C) mixed with 2 µl of Lipofectamine 3000 (Thermo Fisher Scientific, cat# L3000001), according to the manufacturer's protocol.Total cellular RNAs were extracted 48 h post-transfection and analyzed by RT-qPCR, as described above.

Fig. 1
Fig. 1 Systematic discovery of SRS-lncRNAs expressed during neuronal differentiation.A The pipeline used to identify regulated SRScontaining transcripts.Volcano plots for SRS-lncRNA expression changes in B the nuclear and C the cytoplasmic fractions at the NPC with the RBP motif data from the CISBP-RNA database (Ray et al. 2013):fimo --text --no-qvalue --norc --thresh 0.001 --motif-pseudo 0 Fig.2Genomic distribution of SRS-lncRNAs.Box plots showing the distribution of A all genome-encoded SRSs of >200 bp in length and containing ≥3 repeated units, B SRSs in all detectably expressed lncRNAs, and C SRSs in neurally upregulated lncRNAs along the human chromosome arms separated into 10 equally sized bins, form the middle of the centromere (position 0) to the end of the telomere (position 1).The p-arms of the acrocentric chromosomes 13, 14, 15, 21 and 22 encoding the 47S/45S rRNA arrays were excluded from these analyses.D The Kolmogorov-Smirnov (KS) test shows that the distributions of all SRS-lncRNAs (blue) and neural SRS-lncR-NAs (red) along the centromere-to-telomere axis do not significantly differ on a genome-wide scale.The data are presented as empirical cumulative distribution function (ECDF) plots.Karyoplots showing the color-coded density of E all detectably expressed SRS-lncRNAs and F neurally upregulated SRS-lncRNAs, calculated for individual human chromosomes as a number of loci per 5-Mb window.The centromere-to-telomere distributions of the neural and all detectably expressed SRS-lncRNAs were compared for individual chromosome arms using the KS test.A significant difference was detected only for chromosome 9 (chr9).The panels were generated using the Rideogram package (https:// cran.r-project.org/ web/ packa ges/ RIdeo gram/ vigne ttes/ RIdeo gram.html).G A close-up of the centromere-proximal part of chr9 encoding twelve neurally upregulated SRS-lncRNAs organized as six segmentally duplicated bidirectional transcription units (arrows).The panel also displays strand-specific countsper-million (CPM) normalized coverage plots for the nuclear NPC RNA-seq data.The alpha satellite-and GAAAT repeat-containing SRS-lncRNAs, which constitute the bidirectional unit, are highlighted in magenta and teal, respectively.The coverage plot on the top represents RNAs transcribed in the forward direction, while the bottom part shows RNAs transcribed in the reverse direction (colour figure online) ◂

Fig. 4
Fig.4Neuronal differentiation system for SRS-lncRNA validation.A The protocol for Dox-inducible neuronal differentiation of TRE-Ngn2 iPSCs established in this study (see "Methods" for further details).B CPM-normalized RNA-seq coverage plots for the nuclear fraction of ESC, NPC, and Neuronal 14 samples, illustrating one of the six segmentally duplicated chr9 SRS-lncRNA units.The unit encodes the alpha-satellite (XLOC_312995; magenta) and the GAAAT repeat (XLOC_320267; teal) containing SRS-lncRNAs transcribed in the opposite directions from a common GCrich (GGC) promoter region.The two SRS-lncRNAs and the corresponding SRSs are annotated at the bottom.Note that the microsatellite sequences are shown for the forward strand.The diagram also shows the positions of the four RT-qPCR amplicons (assays 1-4) analyzed in C-F.RT-qPCR assays validating the regulation of C and E XLOC_320267 and (D, F) XLOC_312995 expression during neuronal differentiation using primer pairs annealing either upstream (C, D) or downstream (E, F) for the corresponding SRSs.Data were averaged for differentiation experiments carried out using three distinct TRE-Ngn2 iPSC clones ±SEM and compared by one-way ANOVA with Tukey's post hoc test.*, Tukey's P < 0.05; ns, Tukey's P ≥ 0.05

Fig. 5
Fig. 5 Experimental validation of SRS-lncRNAs.We used RT-qPCR with SRS-proximal primers to validate the expression dynamics of A XLOC_319631, B XLOC_320382, and C XLOC_039609/KRTAP5-AS1, which are predicted to be upregulated during neuronal differentiation, as well as D XLOC_185088/CPMER, which is predicted to

Fig. 6
Fig.6SRS-lncRNA expression across differentiated cell types.Venn diagrams show overlaps between the neurally upregulated SRS-lncR-NAs identified by our pipeline using the data from(Blair et al. 2017)