Introduction

Eukaryotes broaden their protein-coding ability by utilising alternative splicing (AS) of pre-mRNAs1,2,3, and more than 90% of human transcripts undergo AS4. AS is controlled by a spliceosome complex that incorporates small nuclear (sn) RNAs, with over 100 associated proteins contributing to its function5. The U1 snRNP is the simplest of the five snRNP proteins and consists of U1 snRNA, seven Smith antigen (Sm) proteins, and three snRNP-specific proteins: U1-70 K, U1A, and U1C. The biogenesis of snRNPs occurs in the cytoplasm with Sm core proteins, and the 5′ and 3′ ends of the snRNAs are modified before snRNPs bind to the nucleus. U1C and U1-70 K interact with U1, U6, and SL RNAs6.

Of the three specific U1 proteins, U1A is mainly involved in binding to the 5′ splice sites (ss) in pre-mRNA. It recognises a conserved nucleotide sequence GU at the 5′ end of the intron. In addition, U1A helps to anchor the U1 snRNP to the correct position of the pre-mRNA and plays an important role in maintaining the stability of RNA-RNA interactions within the spliceosome7. U1-70 K is involved in the recognition of the 5′ ss. It specifically recognises the conserved splice donor site sequence at the exon–intron junction, which helps to stabilise the interaction between U1 snRNA and the 5′ ss and ensure the correct positioning of the spliceosome on the pre-mRNA. In addition, U1-70 K plays an important role in preventing premature splicing of transcripts, ensuring that splicing occurs only at appropriate splice sites8. U1C is primarily responsible for binding to U1 snRNA, which can interact directly with U1 snRNA to help stabilise the RNA–protein complex and plays a role in maintaining the integrity and structure of the U1 snRNP. This protein helps facilitate base-pairing interactions between U1 snRNA and pre-mRNA at the 5′ ss9. U1 proteins typically contain RNA recognition motif (RRM) structural domains that allow them to bind to U1 snRNAs and other RNAs, and play an important role in spliceosome formation and function. Specifically, U1 proteins bind to U1 snRNA through their RRM structural domains to form the U1 small nuclear RNP complex, which in turn binds to the 5′ ss at the splice site, facilitating spliceosome formation and splicing8. In addition, the RRM structural domain of the U1 protein may interact with other splicing factors and transcriptional regulators, and participate in the fine regulation and modulation of splicing10. Overall, both RRM and U1 proteins play important roles in the regulation of RNA splicing, and their structures and functions are of significance for understanding RNA splicing and its regulation.

Currently, there are many studies related to U1C. Human Studies have shown that changes in U1C gene and protein expression affect splicing, which in turn affects the expression of disease-associated genes and this has implications for disease progression11,12. For example, knockdown of U1-70 K and U1C increases protein expression of the Alzheimer’s disease-related genes APP and GSK-3β, while decreasing Nicastrin expression11. Additionally, transcriptome analysis of u1c mutants in related studies in zebrafish suggests that U1C plays an important role in AS, and that such mutations do not abolish alternative splicing, but rather alter the AS of a number of genes (e.g., dync1li1, ilf3, otpa, btf3, zgc:162,329, zgc:123,105, zgc: 152,873, ldb1a, and bcl7a) and may depend on the abundance of U1C13. In addition, several studies have confirmed a correlation between U1-70 K and U1C. Intra-U1 snRNP cross-regulation links U1C and U1-70 K expression via selective splicing, and U1 snRNP assembly14. Isoform equilibrium of U1-70 K is maintained by U1C protein levels. Downregulation of U1-70 K transcripts (as splice variants produce nonproductive mRNA isoforms) and proteins results in impaired U1C entry into U1 snRNP15. This feedback induces the production of normal U1-70 K, which restores U1 snRNP assembly. In plants, Arabidopsis thaliana U1C is similar to U1A and is equally distributed in the nucleus and cytoplasm, whereas U1-70 K is present in nuclear splice patches16. Nucleosomes and Cajal bodies are involved in the maturation of U1 snRNPs and U1C/A is predominantly found in these two nuclear regions. Therefore, there is increasing interest in examining plant U1C proteins and how they mediate U1 snRNP-5′ss.

This study provides a phylogenetic description of plant U1Cs and a basis for future functional studies of the regulation of plant AS. We studied U1C gene families in different species of plant and analysed their phylogenesis, conserved gene structures, protein domains, and spatiotemporal gene expression profiles. Our results provide basic phylogenetic, genomic, and gene expression information about the U1C gene family, which lays the foundation for further studies on the function of the U1C gene family.

Results

Sequence identification and phylogenetic analysis of U1C in plants

A BLAST search using the Arabidopsis U1C protein sequence was performed to identify other plant U1C genes in Phytozome. After filtering for sequences without the U1C (C2H2-type zinc finger) motif or truncated sequences, the remaining 114 putative U1C sequences from 72 plant species were obtained, including eight algae, one basal angiosperm, three bryophytes, 14 monocots, and 46 eudicots. Three species contained four copies of the U1C gene, six species contained three copies of the U1C gene, 21 species contained two copies of the U1C gene, and 42 species obtained one U1C candidate gene (Table S1). We also constructed phylogenetic trees for the U1A and U1-70 K genes of these 72 species for comparison with U1C (Tables S2, S3).

No more than four copies of the U1C gene were identified in any of the species. Notably, Four copies of U1C were obtained in Kalanchoe laxiflora (Milky Widow’s thrill) and Gossypium hirsutum (Upland cotton), which coincide with recent polyploidy events17. Three copies of this gene have been observed in two Brassicaceae species: cabbage and turnip mustard. One or two copies of the gene have been observed in common plant species, including bryophytes (moss and liverwort), monocots (switchgrass, sorghum, and maize), and dicots (sunflower, cashew, carrot, apple, potato, tomato, cassava, medicago, chickpea, cowpea, flax, cotton, and olive trees). As expected, all algal species contained only one U1C gene. Other common plants with a single copy include monocots (rice, millet, and barley) and dicots (lettuce, Arabidopsis, Capsella, papaya, cucumber, castor bean, common bean, red clover, cacao, strawberry, orange, peach, Populus, grape). Unlike U1C, the U1A and U1-70 K genes have no more than three copies. There were differences in the copy numbers of U1A and U1-70 K genes in each species compared with U1C. For example, in algae, Botryococcus braunii has two copies of the U1A gene, Porphyra umbilicalis has two copies of the U1-70 K gene, rice has only one copy of the U1C gene and two copies of the U1A gene, and barley has only one copy of the U1C gene and two copies of the U1-70 K gene. This suggests that only U1C has uniquely different copies compared to other U1 proteins, including U1A or U1-70 K. Phylogenetic trees were constructed for U1C, U1A, U1-70 K. All sequences clustered into five major clades exhibiting general plant phylogenetic classifications: algae (yellow), bryophytes (green), monocots (pink), basal angiosperms (purple), and dicots (purple). The phylogenetic tree displays a clear topology and high bootstrap values for each clade (dark pink) (Figs. S1S3 and the leftmost panel in Fig. S4). In addition, the yellow branch (algae) was the base part of the phylogenetic tree and was far from the remaining three branches, indicating that it was distant from the U1 family of genes in other plants. One sequence each from Arabidopsis halleri and Arabidopsis thaliana formed a small clade at the top of the tree, along with other Brassicaceae species (Brassica and Capsella), consistent with their close phylogenetic relationships. In contrast, the U1 family of genes in the model monocotyledon (Oryza sativa, rice) clustered with other monocotyledons. All sequences with two or three sequences from the same species tended to cluster closely together in the same subclade.

Comparative analysis: gene structure and DNA motifs

Analysis of the evolutionary path of the plant U1C gene and determination of its conserved function require a comparison of its gene structure and conserved gene motifs. The gene structure model and the corresponding conserved motifs were connected to a phylogenetic tree (Fig. S4). To facilitate observation of gene structure, we constructed the longest coding sequence of the U1C gene. Surprisingly, most of the U1C sequences exhibited four-exon-three intron organization (Fig. S4, middle panel), indicating that their gene structure is highly conserved in plant genomes. This suggests that U1C genes are functionally conserved in most species. However, a few exceptions were observed. Two maize U1C sequences contained eight and five exons, and one sorghum U1C sequence contained six exons. However, Brassicaceae has the smallest U1C gene structure among plant species. The largest variation among the subclade sequences was observed among algae. The Porphyra umbilicalis U1C sequence has one exon and no introns. In contrast, the U1C sequences of Chlamydomonas reinhardtii and Volvox carteri contained seven exons, and the U1C sequence of Dunaliella salina contained eight exons.

Variation in gene structure was minimal among U1C genes. Therefore, we attempted to determine whether the motif compositions of the cDNA sequences reflected any differences. Maximisation for motif elicitation (MEME) analysis showed that approximately 100 U1C genes in plants have similar sequence characteristics, and eight or nine of the first ten motifs were located in similar sequence positions (Fig. S4, right panel), with little difference among basal angiosperms, monocots, and dicots. However, sequences with two or three motifs were exclusively observed in the clades of algae and Bryophyta, even in sequences with gene structures seemingly similar to those of other plant species. Thus, there is no necessary link between gene structure and the appearance of motifs.

Conservation of protein domains and genomic synteny analysis of plant U1Cs

A phylogenetic tree was constructed by analysing protein domains and conserved amino acid (aa) motifs (Fig. 1). All peptides contained a U1C C2H2-Zinc finger domain at the amino terminus (Fig. 1, middle panel). MEME was used to annotate the ten most prominent aa motifs, which are presented as coloured boxes (Fig. 1, right panel). Again, the motifs of algal and bryophyte species were the least abundant among all observed species. Monocots possess an intermediate number of motifs. This suggests a divergence of plant U1C proteins among clades. In addition, more motifs were identified during evolution. As expected, large variations in gene structure may lead to a reduction in the number of conserved motifs. This analysis suggest that algal protein sequences may be less conserved than those of other plant species.

Figure 1
figure 1

Protein structure organization and identification of conserved amino acid motifs among plant U1C genes. (A) Protein structure (middle panel), conserved amino acid motifs from MEME analysis, the top ten conserved motifs are indicated by different coloured boxes (right panel), and the vertical phylogenetic tree (left panel). In the middle panel represent U1 zinc finger. The small colored box represents different motifs in the right panel. (B) In the MEME analysis, the first ten conserved amino acid motifs are indicated by different coloured boxes. The logos and detailed information of ten amino acid motifs from MEME are shown below. The relative size of the symbols indicates their frequency in the sequence. The height of each symbol is proportional to the frequency of occurrence of the corresponding base at that position. Black vertical lines in the phylogenetic analysis represent the break at that particular branch.

Synteny analysis of the exons revealed that the U1C gene structures of 46 plant species were highly conserved (Fig. S5). The middle part of the gene structure of all members was highly similar. However, this match was diminished at both ends of the gene, suggesting that although the regulation of U1C may vary owing to differences in the end part, its regulation of gene expression is somewhat similar across gene families because of the low similarity between coding sequences.

Clustering associated with synteny followed the phylogenetic tree, with the exception of the bryophyte Physcomitrella patens, which clustered with the dicots at the top of the tree.

Analysis of promoter and tissue/stimulus-specific expression

The PlantCARE webtool was used to scan a 2-kb upstream sequence (5′-flanking) of U1C genes (Table S4)18, and the expression profiles of these genes were analysed using data retrieved from the eFP browser, Phytozome, RiceXPro, Road v2 database (Figs. 2, 3 and Figs. S6S9). An extremely low representation of cis-elements was detected in sequences with only five cis-elements (Fig. 4 and Table S4). As expected, the CAT-box (for meristem expression) was the most represented and was detected in front of the transcription start sites. Owing to their CAT-box abundance, U1Cs may play a crucial role in newly formed tissues and are expected to fulfil the growing demand for transcription and downstream biological processes. HD-Zip 1 was found in five sequences, RY elements in five sequences, and motif 1 in one sequence. The only noticeable presence of the GCN4_motif in Brassicaceae species was surprising, and may be related to its importance in seed development and germination19. In the dicot Arabidopsis, U1C is highly expressed in developing embryos, maturing siliques, stigma, ovaries, seeds, and shoot apex, implying a potential role for the CAT-box and GCN_4 motifs of this gene in driving shoot-tip expression during flowering (Figs. 4 and S6). Notably, its expression was downregulated in flower and leaf tissues at seedling and adult plant stages. However, in P. trichocarpa, seedlings had a high transcript abundance (Fig. S7). Figure 2 shows the expression of the two U1C genes in the tissues of the three species. Glyma.02G152300 was the most abundant in new tissues. Solyc01g067830.2 in tomato, is highly expressed in fruits, and PGSC0003DMP400038064 in potatoes, is highly expressed in tubers. Another U1C gene, Glyma.10G021900, was expressed at lower levels in soybean than the former, consistent with the results obtained from the Phytozome database (Fig. S8). The expression of another U1C gene, Solyc06g00860.2 was similarly reduced in tomatoes. PGSC0003DMP400039624 was expressed at high levels in potatoes (Fig. 2). In monocots, one maize U1C gene was exclusively upregulated in the tassel and ear (male and female florets), implying a role in fertilisation and floral development. In rice, the highest expression was observed in the newly developing shoot apical meristem tissues and young leaves. In Brachypodium, shoot tissues showed the highest expression levels in very young and older plants (Fig. 3). In addition, based on RiceXPro, we found that the rice U1C gene was highly expressed in the embryo and endosperm 42 days after flowering (Fig. S8). In summary, U1C genes are mainly expressed in organs that act as a source of energy and nutrients for future growth or in processes that prepare such organs.

Figure 2
figure 2

Tissue expression of representative U1Cs in selected dicots. Expression data downloaded from plant eFP browser and transformed by log conversion (base = 2) is presented as heatmap.

Figure 3
figure 3

Tissue expression of representative U1Cs in selected monocots. Expression data downloaded from plant eFP browser and transformed by log conversion (base = 2) is presented as heatmap.

Figure 4
figure 4

Summary of promoter motifs identified in plant U1C genes that putatively confer tissue specificity. Five motifs are represented by orbiculars of different colors and labelled along the 2-kb upstream promoter region (straight line) for each plant U1C gene. CAT-box: cis-acting regulatory element related to meristem expression; GCN4 motif: cis-regulatory element involved in endosperm expression; HD-Zip1: element involved in differentiation of palisade mesophyll cells; RY-element: cis-acting regulatory element involved in seed-specific regulation.

Plants are sessile organisms that rely on gene expression to tolerate and adapt to changing environments20. To identify the most significant internal and external stimuli responsible for regulating plant U1Cs, we explored the stimulus-specific motifs in these sequences (Fig. 5). Twenty motifs were identified in the promoter region of plant U1Cs. The most abundant stimulus-specific cis-elements in the U1C sequence were the TGA element motifs, which are auxin-responsive motifs, followed by drought-responsive (ABRE) and temperature response (LTR) elements (Fig. 5). Correspondingly, Arabidopsis U1C expression was induced by auxin treatment (Fig. S6A). These results provide insights into the shoot-specific response of Arabidopsis U1C to drought, oxidative stress, wounding, and osmotic stress. In contrast, heat stress and Phytophthora infection upregulated this gene in all species (Fig. S9A). The expression of Arabidopsis AT4G03120 obtained from the eFP browser under different abiotic stimuli (time series from 0 to 24 h)21 showed early induction under heat stress (at 1–3 h, which subsided at 4 h in shoots but remained induced in roots), but slight shoot-specific downregulation under ROS-generating stress conditions (e.g. osmosis, drought, oxidative, genotoxic and wounding) and root-specific downregulation under salt stress (Fig. S9A). In rice, we obtained genetic changes in the U1C gene in response to drought, cold, high temperature, and salt stress from the ROAD v2 database (Fig. S9B) and U1C gene expression fluctuated. Heat and salt stress showed a trend of upregulation followed by downregulation. Cold stress showed an early induction of expression. In addition, we verified the expression of U1C gene (LOC_Os02g16640) in rice under different abiotic stimuli (0–12 h time series) using qPCR. U1C gene was induced early in shoots and then gradually weakened under low temperature (1–3 h) and salt stress (1–6 h) but was induced again after 12 h in roots. There were similar trends (at 1–6 h, early induction) in shoots and roots under drought stress, and the Cd treatment showed steady induction (at 1–12 h) in shoots. The opposite trend was observed, except that there was no significant difference between 6 h and the control in the roots under the Cd treatment (Fig. S9C).

Figure 5
figure 5

Summary of promoter motifs identified in plant U1C genes that putatively confer response to internal and external stimuli. Twenty identified motifs are represented by symbols of different colors and are labelled along the 2-kb upstream promoter region (straight line) for each plant U1C gene. ABRE: ABA binding response element; ARE: Anaerobic response element; AT-rich sequence: cis-element for maximal elicitor-mediated activation; AuxRR-core: cis-acting regulatory element involved in auxin responsiveness; CGTCA-motif: cis-acting regulatory element involved in MeJA-responsiveness; GARE-motif: gibberellin-responsive element; GC-motif: enhancer-like element involved in anoxic specific inducibility; LTR: cis-acting element involved in low-temperature responsiveness; MBS: MYB binding site involved in drought-inducibility; P-box: gibberellin-responsive element; SARE: Synaptic Activity-Responsive Element; TATC-box: cis-acting element involved in gibberellin-responsiveness; TCA-element: cis-acting element involved in salicylic acid responsiveness; TC-rich repeats: cis-acting element involved in defense and stress responsiveness; TGACG-motif: cis-acting regulatory element involved in MeJA-responsiveness; TGA-element: auxin-responsive element; WUN-motif: wound-responsive element. Black vertical lines represent the break at that particular branch.

Biotic treatments of leaf tissues with pathogens, including Botrytis (no change), Phytophthora infestans (significant induction), Pseudomonas (no change) and bacterial elicitors such as Flg22 (no change) for AT4G01320 suggest a possible pathogen-specific role in biotic responses. Hormone time-series data did not suggest any significant upregulation by any hormone, at least at the seedling stage, perhaps because of finer regulation at the cellular than the organ level. The eFP data suggested strong induction during embryonic development in cotyledons and roots (strongest), especially in the torpedo meristem region. Another notable strong induction was observed in stigma and ovaries (Fig. S6B), similar to the tassel and ear induction in maize U1C. Therefore, the corresponding cis-elements predicted in their promoter regions can be used as hypothetical targets for transcriptional regulation, mainly in seeds, through osmotic or osmotic-like stress and heat stress (Fig. 4).

In conclusion, significant differences were observed in the motif composition of U1Cs in plants, indicating the complexity of transcriptional regulation. Additionally, each promoter can be regulated by a combination of internal and external cues.

AS profile analysis and splicing isoforms

A comparison of gene structures of a few representative U1Cs, U1As, U1-70Ks across algae, bryophytes, monocots, and dicots revealed that alternative transcripts did not occur frequently in the U1C, U1A, U1-70 K gene families (Figs. 6, S10, S11). However, some species, including Physcomitrella patens (4), Hordeum vulgare (9), and Glycine max (3) in U1C and Hordeum vulgare (11) in U1-70 K, contain multiple transcript genes. Even for these splice isoforms, gene structures did not differ significantly from representative transcript forms, and most sequence differences were observed in the UTR. Thus, U1 family genes may not undergo large splicing changes and must act uniformly on substrates with conserved zinc-finger motifs across isoforms. Splicing-induced functional differences in U1C, U1A, U1-70 K were assumed to result in minor changes in substrate gene expression.

Figure 6
figure 6

AS profile analysis of U1C. Summary of annotated alternatively spliced transcript isoforms for identified U1C genes. Gene models were screened from the Phytozome v13.0 website.

Interaction network and homology modelling

We also investigated the role of U1C in biological regulation. Using U1C protein of Arabidopsis, rice, humans, mouse and yeast (YLR298C) as parameters, the U1Cs protein interaction network was constructed using WebTool String (Fig. S12A and Table S5). Two plant species (Arabidopsis and rice), two animal species (Homo sapiens and mice), and a yeast (Saccharomyces cerevisiae) were included. Interestingly, humans and mice share nine of the ten proteins, whereas yeast U1C appears to bind to a different batch of spliceosomal proteins in addition to the familiar U1A. Furthermore, compared with humans, the two plant species shared only two of the eleven protein interactors, suggesting that these two species have different protein networks for their own U1Cs. However, further validation is required to elucidate specific molecular functions.

U1C is the central component of the U1 snRNP and interacts with several other snRNP proteins. Understanding the protein structure is crucial for subsequent biochemical and functional analyses. A 3-D model of plant U1C proteins was constructed based on the template and crystal structures of human U1 snRNPs using homology modelling (Figs. S12B and S13). Although the crystal structure of human U1C is incomplete, the RRM domain containing the zinc finger motif (residues 1–38 in Arabidopsis) has been resolved in this crystal structure. Therefore, this result is sufficient to explain the existence of RRM domains in plants. Furthermore, conservation of the RRM domain was analysed. We observed a ConSurf Grade of 26 (49.1%) residues over 7, a ConSurf Grade of 12 (22.6%) residues over 9, and an amino acid conservation level of more than 80% among 51 (96.23%) sites (Fig. S14). RNA-binding sites were conserved, except for Arg3 and Tyr12 (Fig. S12B). Thr11, Thr14, His15, Ser19, Gln23, and Tyr28 form hydrogen bonds with RNA and play crucial roles in RNA binding. Arg3 interacts with RNA in almost all species. However, it was replaced by Leu, Gly, Val, and Ser in Miscanthus sinensis, Sorghum bicolor, Hordeum vulgare, and Brassica oleracea capitata, thereby influencing electrostatic interactions with RNA (Fig. S14). Tyr12 is highly conserved in all spermatophytes, except Hordeum vulgare. In summary, the RNA-binding domain of plant U1C is highly conserved, except for rare genetic mutations in some species.

Discussion

Misregulation of splicing leads to aberrant splicing of gene products. Aberrant splicing may affect protein function, stability, and expression, thereby adversely affecting cellular functions. Splicing abnormalities due to misregulation of the spliceosome cause 15–35% of human disease22,23. Studies in plants have also found that dysregulation of WUSCHEL and CLAVATA3 at low temperatures may cause meristem defects in pcp-1 loss-of-function mutants24. However, the molecular mechanisms by which splice protein mutations affect splicing regulatory fidelity remain unclear. In addition, the discovery of the mechanisms that regulate the constituents of AS raises fundamental questions regarding the use of splicing codes in eukaryotes. There have been relatively few studies on U1C genes in plants. This study also provides a preliminary framework for future research.

U1Cs display well conserved gene structures and minimal variation among genes

In this study, 114 U1C genes were identified in 72 plant species, ranging from aquatic algae to terrestrial plants. In more than half (42/72) of these species, the plant U1C gene family tended to retain one copy after several rounds of whole genome duplication (WGD)25, similar to other U1 family genes, with 70 species in U1A containing 47 single copies, and 71 species in U1-70 K, of which 40 were single copies. The U1 gene family appeared to be more susceptible to WGDs or segmental duplication events, with 41% (30 of 72 plant species) of U1C gene families, 33% (23 of 70 plant species) of the U1A gene family, and 43% (31 of 71 plant species) of the U1-70 K gene family having multiple members (Figs. 1, S4 and Tables S1S3, S6). Taking the U1C gene family as an example, species such as Sorghum bicolor, Panicum hallii, and Miscanthus sinensis can undergo several rounds of whole-genome duplication to form multiple gene copies, with changes in motifs, and structural domains (Figs. 4, 5), which implies that multi-copy genes may also be an important strategy for eukaryotes, including plants, to adapt to the external environment. Recent studies have hypothesised that gene family expansion after WGD provides novel candidates for functional diversification26. However, the functions of U1C need to be elucidated through further mechanistic studies.

As an important U1-specific snRNP, U1C has been extensively studied in animals and yeast. Homology of the sequence of U1C suggests that its function is conserved. In most cases, the zinc finger motif is sufficient to recognise the splice sites and its N-terminal region of the zinc-finger is highly conserved among species examined27. Furthermore, U1 snRNP modulates the starting and a prerequisite step in the formation of the spliceosome complex assembly (before U5 and U6 interactions with the exonic and intronic positions around the 5′ ss). Thus, it is evolutionary advantageous to have a conserved function across species as the basic step of 5′ ss recognition, followed by pairing with snRNPs, which must be performed similarly in all organisms13. Third, U1C is an essential component of splicing-constitutive or alternative pathways, and its levels are tightly regulated by U1-70Ks, especially in the context of nonsense-mediated mRNA decay, to maintain the required supply for sense transcript production14.

Exon synteny analysis indicated that the gene structure of plant U1Cs is highly conserved (Fig. S3). Furthermore, splicing was not frequently observed among plant U1Cs and was only observed at the ends (Fig. 6). For example, plants tend to produce transcription isoforms with truncated 5′ or 3′ sequences. Nevertheless, U1Cs seem to generate similar gene structures among isoforms, which may indicate that U1C and its isoforms similarly regulate gene expression (Fig. S7).

Plant U1Cs accumulate in seeds and embryos and are differentially regulated by abiotic and biotic stresses

U1-snRNP is an early subcomplex of the spliceosome assembly and U1C acts as the central unit of the snRNP complex (Fig. S7A). However, there have been few reports on its biological functions in plants28. U1C in plants, particularly Arabidopsis, is highly expressed in reproductive organs, embryos, and siliques/seeds (Figs. S4B and S5), suggesting that it may play a role in reproduction and early seedling development. This was confirmed by the high expression of U1C in the embryo and endosperm of other plant species, including maize and rice (Figs. 3 and S8). The available microarray expression data show that the transcriptional abundance of U1C genes in plants is uniformly distributed across these tissue types. Thus, the function of plant U1Cs in these developmental processes should be a topic for future studies. In addition, U1C may be involved in plant responses to stress. Among them, Arabidopsis U1C is more likely to act in response to drought and related stress conditions29 and to be thermally induced, which correlates with the abundance of predicted drought- and temperature-responsive cis-elements in the 5′-flanking region of Arabidopsis U1C (Figs. S4, S6). A search of a rice database also revealed that the U1C gene is involved in response to multiple stresses, including drought, cold, and heat (Figs. S9B,C). Intriguingly, very few tissue-specific motifs were found in the promoters of U1Cs compared to those in SYF2 or U1-70Ks30,31, and the enriched motifs were seed-specific, suggesting a high confidence of U1C localisation to seeds. In addition, Arabidopsis U1-70 K is responsive to multiple plant auxins, which are involved in morphogenesis31. Auxin-responsive motifs were found to be enriched in the promoters of U1Cs, suggesting that auxin is involved in regulating U1C expression to control embryo development (Fig. 5). However, these findings were based purely on silicon prediction and online expression datasets. The potential effects of these modes and their corresponding trans-acting proteins need to be validated by additional experiments, such as real-time quantitative PCR, and evaluation of the role of mutated modes in transactivation, electromigration, and chromatin immunoprecipitation32. Therefore, further research is necessary to clarify the specific functions of U1Cs during embryonic development, seed germination, and early seedling development using transgenic plant materials, including loss-of-function mutants, overexpression, and complementation lines33. However, their functional significance in plants has not been fully explored.

Conclusions

In this study, 114 U1C genes were identified in 72 plant species, and phylogenetic and bioinformatic analyses of the U1C gene family were performed. Because of the important role of U1C in improving splicing efficiency and functional preservation in plants, we hypothesised that plant U1C has important functions in response to various plant-specific environmental stimuli. Understanding their roles in plants using modern omics methods could provide alternative strategies in agricultural biotechnology, including crop gene manipulation, based on U1-snRNA interference34.

Materials and methods

Sequence identification of U1C

The Arabidopsis U1C sequence was used as a template to perform a protein BLAST search with an e-value cut-off of 1e−1035 in Phytozome v13.0. Selected sequences containing PF06220 were further filtered using HMMER v3.2.136,37. As a result, 114 putative U1C sequences from 72 plant species were selected for subsequent analyses (Table S1).

Phylogenetic analysis

Protein sequences of the U1C gene of the aforementioned plants were derived from Phytozome v13.0 annotation for phylogenetic tree construction to infer clustering patterns and evolutionary relationships. For genes with splice variants, the gene with the longest CDS was selected. Protein sequence alignment was performed using Muscle v3.8 with default settings38. The phylogenetic relationships among plant U1C genes were inferred using the Bayesian method implemented in MrBayes v3.2.2 package39. FigTree v1.4.3. was used to visualise and modify the phylogenetic tree40.

Analysis of gene structure, protein domain, MEME motif, and synteny analysis

Gene structure and protein domain were obtained from the Phytozome v13.0 and Pfam databases. The top ten detected motifs were obtained using multiple expectation MEME applied to the MEME server v5.0.5, from the input cDNA and protein sequences of plant U1Cs41.

We used U1C genes from 45 plant genomes belonging to major plant lineages. In addition to the genomes available in the Phytozome database, one representative sequence was used for each species. Synteny analysis was performed as previously described42.

Promoter motif prediction

The 2-kb promoter sequences of U1C genes were obtained from Phytozome and used to predict cis-elements using PlantCARE18. Motifs commonly associated with tissue-specific expression and responses to stimuli were further studied. Analyses were performed using TBtools43.

U1C gene expression data and RT‒qPCR analysis

Expression data for U1C genes, including tissue specificity and stress responses, were obtained from the eFP browser, RiceXPro, Phytozome, and ROAD v2. For the plant U1C gene, expression data were converted logarithmically to base 2 and displayed as heat maps using the BAR HeatMapper Plus tool.

Nipponbare seedlings were grown under hydroponic conditions at 22/20 ℃ with a 16/8 h light/dark cycle and a light intensity of 120 μmol m−2 s−1 in an incubator. Seedlings with two true leaves and one bud were treated with four stresses: a cold treatment at 8 °C, a drought treatment imposed with 20% PEG6000, salinity stress imposed by application of 100 mmol L−1 sodium chloride, and a cadmium toxicity treatment imposed by application of 100 μmol L−1 cadmium sulfate. Total RNA was isolated using TRIzol reagent and converted to complementary DNA using PrimeScript™ RT Master Mix following the manufacturer’s protocol. RT-qPCR using U1C was performed using a 7500 Real-Time PCR Detection System in conjunction with a SYBR Fast qPCR Mix. OsACTIN was used as an internal control, and relative gene expression levels were calculated according to Cao et al.44.

Protein‒protein interaction network, homology modelling, and estimation of amino acid conservation

Arabidopsis, Oryza sativa Japonica, Homo sapiens, Mus musculus and Saccharomyces cerevisiae U1Cs were used as inputs for STRING network analysis to identify interaction partners. Proteins with high interaction rankings were identified using a protein–protein interaction network. Homology modelling was conducted using the Arabidopsis U1C protein sequence based on the crystal structure of human U1 snRNP using SWISS-MODEL (Fig. S8)45. After valid sequences were screened using the ConSurf web server, the amino acid conservation score was calculated using the maximum-likelihood method46. Plots were created using PyMOL after inputting the protein structure and MSA data47.