Maize transposable elements contribute to long non-coding RNAs that are regulatory hubs for abiotic stress response
Several studies have mined short-read RNA sequencing datasets to identify long non-coding RNAs (lncRNAs), and others have focused on the function of individual lncRNAs in abiotic stress response. However, our understanding of the complement, function and origin of lncRNAs – and especially transposon derived lncRNAs (TE-lncRNAs) - in response to abiotic stress is still in its infancy.
We utilized a dataset of 127 RNA sequencing samples that included total RNA datasets and PacBio fl-cDNA data to discover lncRNAs in maize. Overall, we identified 23,309 candidate lncRNAs from polyA+ and total RNA samples, with a strong discovery bias within total RNA. The majority (65%) of the 23,309 lncRNAs had sequence similarity to transposable elements (TEs). Most had similarity to long-terminal-repeat retrotransposons from the Copia and Gypsy superfamilies, reflecting a high proportion of these elements in the genome. However, DNA transposons were enriched for lncRNAs relative to their genomic representation by ~ 2-fold. By assessing the fraction of lncRNAs that respond to abiotic stresses like heat, cold, salt and drought, we identified 1077 differentially expressed lncRNA transcripts, including 509 TE-lncRNAs. In general, the expression of these lncRNAs was significantly correlated with their nearest gene. By inferring co-expression networks across our large dataset, we found that 39 lncRNAs are as major hubs in co-expression networks that respond to abiotic stress, and 18 appear to be derived from TEs.
Our results show that lncRNAs are enriched in total RNA samples, that most (65%) are derived from TEs, that at least 1077 are differentially expressed during abiotic stress, and that 39 are hubs in co-expression networks, including a small number that are evolutionary conserved. These results suggest that lncRNAs, including TE-lncRNAs, may play key regulatory roles in moderating abiotic responses.
KeywordsLong non-coding RNA Transposable elements Abiotic stress Co-expression network
Long noncoding RNA
Long intergenic noncoding RNA
Fragments per kilobase per million mapped reads.
Long terminal repeat
differentially expressed genes
The functional component of any genome extends beyond its protein coding sequences. Much of the additional function is encoded by RNAs, which vary in size from small RNAs (sRNAs) of< 25 nucleotides (nt) in length, to tRNAs of 70 to ~ 90 nt in length, to an even larger class of long non-coding RNAs (lncRNAs). lncRNAs are typically defined as being longer than 200 nt and containing no more than one short (< 100 amino acids) open reading frame .
lncRNAs represent a stunning proportion of transcriptional products. In mice, for example, an early study cataloged ~ 34,000 lncRNAs, representing one-third of all polyadenylated cDNAs . More recent work has annotated ~ 14,000 lncRNAs in humans . Work in plants has lagged somewhat behind, but plant lncRNAs have been identified based on various kinds of high throughput expression data. For example, microarrays have been used to detect 6480 lncRNAs from Arabidopsis thaliana ; single-stranded RNA sequence data have led to the identification of 2224 lncRNA transcripts in rice (Oryza sativa) ; and total RNAseq data have been employed to detect 7245 lncRNAs in maize (Zea mays ssp. mays) .
At least three general properties of lncRNAs have become apparent from studies of both plants and animals. The first is that many lncRNAs are polyadenylated and capped, suggesting that they are transcribed and processed similarly to mRNAs . However, lncRNAs can also be non-polyadenylated, and hence robust lncRNA discovery requires consideration of both polyadenylated and non-polyadenylated RNA samples. The second is that lncRNAs tend to be expressed at lower levels than coding genes, but with precise spatio-temporal patterns [3, 7, 8, 9, 10, 11, 12, 13]. A third general property is that some lncRNAs overlap with coding regions and sometimes contain parts of an exon; however, most originate from intergenic spaces (and these are sometimes called long intergenic RNAs or lincRNAs). Consistent with their origin from intergenic spaces, a large proportion of lncRNAs are either derived from transposable elements (TEs) or contain remnants of TEs. For example, Kapusta et al.  determined that 75% of human lncRNAs contained regions that appear to be derived from TEs.
Just as the origin and structures of lncRNAs are diverse, they play similarly varied functional roles. One major role is to act as templates for sRNA production, which in turn often contribute toward the epigenetic silencing of TEs [14, 15]. Some lncRNAs perform other key functions, especially regulatory roles in cellular and developmental processes [3, 16]. In plants, for example, lncRNAs have been shown to affect functions as diverse as phosphate signaling , flowering time , and susceptibility to pathogens . Consistent with the hypothesis that lncRNAs play important regulatory roles, some lncRNAs are conserved among species and appear to be under purifying selection [3, 20, 21].
A growing body of evidence also points to a potential role for plant lncRNAs in responses to abiotic and biotic stresses. A few studies have identified Arabidopsis lncRNAs that respond to salt, drought, heat and cold stresses, as well as phosphate starvation [22, 23, 24]. The expression of 28% (1832 of 6480) of Arabidopsis lncRNAs was found to be significantly altered under biotic and/or abiotic stresses . These findings – i.e., that lncRNAs are associated with stress responses – are particularly important in the context of crop species, because abiotic stresses affect crop yield and quality [13, 25, 26, 27, 28, 29]. However, the identification of lncRNAs during crop stress response remains largely unexplored, with a few notable exceptions. For example, 637 nitrogen-responsive lncRNAs and 664 drought-responsive lncRNAs have been identified in maize seedlings [6, 30]. Similarly, 1010 and 1503 lncRNAs are known to be differentially expressed under abiotic stress in rice and in chickpea . An important but challenging issue is to discover lncRNAs that are associated with abiotic stress responses and then to determine which lncRNAs function as key regulators, which serve as sRNA templates and which represent transcriptional noise.
Here we identify lncRNAs that relate to abiotic stress responses in maize. Our work extends previous maize lncRNA studies in at least three ways [6, 8, 30]. First, our efforts to detect lncRNAs are based on more expansive data. To perform lncRNA discovery, we have amassed 127 RNAseq datasets that were generated by different methods, in different tissues and across developmental stages, with a large subset generated in abiotic stress experiments, including salt, drought, heat, cold, UV and ozone stresses. The data include 89 RNAseq samples based on Illumina sequencing, 36 RNAseq datasets based on PacbioIsoSeq experiments, and two Illumina RNAseq datasets that were based on total RNA to potentially detect non-polyadenylated lncRNAs. Second, we investigate the relationship between TEs and lncRNAs. More than 85% of the maize genome consists of DNA derived from TEs , and we therefore expect that many lncRNAs exhibit sequence similarity to TEs. Thus far, however, the connection between lncRNA and specific TE superfamilies has not yet been investigated for maize. Third, we identify the subset of lncRNAs that are differentially expressed under abiotic stress to begin to narrow the set of candidates that function in stress response. To further narrow a candidate list of potentially functional lncRNAs, we also investigate co-expression of lncRNAs with neighboring genes and within expression networks [33, 34]. Bringing these diverse analyses together, we identify several lncRNAs that are hubs in co-expression networks that respond to abiotic stress and show that several of these hubs are lncRNAs derived from TEs.
Construction of transcripts and lncRNA discovery
A summary of lncRNA discovery
Fall within annotated lncRNAs
Overlap with annotated lncRNAs
The 23,309 lncRNAs were widely distributed across the 10 maize chromosomes (Additional file 1: Fig. S2). We also examined their location relative to annotated coding sequences within the maize genome. As expected from our search strategy, most lncRNAs (87.9%, 20,499 of 23,309) were intergenic, based on the output (a U class code) from gff compare. Only 185 lncRNAs were found to be intronic, with 29 and 156 of these as polyA- and polyA+ (Table 1). The few remaining high confidence lncRNAs corresponded to, or overlapped with, previously annotated lncRNAs in the B73 v4 reference (Table 1). Among the 20,499 lincRNAs, 44.7% (or 9153 of 20,499) were from total RNA datasets (i.e, potentially polyA-), representing a significant enrichment for lncRNAs within the total RNA samples (Pearson χ-squared; p < 0.001).
Most lncRNAs are derived from transposable elements
The proportion of base pairs attributable to different TE superfamilies based on the total length of inferred TE-lncRNAs and the B73 reference genome
Number of TE-lncRNAs
% total length of TE-lncRNAs
% total length of TEs in the genome
These observations raise an interesting question: Do LTR/Gypsy and LTR/Copia elements give rise to lncRNAs more often than expected, given their proportion of the genome? To address this question, we estimated the proportion length among all annotated TEs that were attributable to LTR/Gypsy, LTR/Copia and other element superfamilies, based on RepeatMasker analyses. We then compared these percentages to the proportion length among inferred TE-lncRNAs (Table 2). We found, for example, that LTR/Gypsy elements produced TE-lncRNAs at roughly the expected proportion (61% vs. 59%), relative to their representation in the genome. However, LTR/Copia elements contributed TE-lncRNAs at a lower proportion than their proportion length among annotated TEs (22% vs. 33%). Particularly notable is the fact that class II DNA elements produced TE-lncRNAs in our dataset at ~ 2-fold higher rate (12% vs. 6%) than expected based on their total length among TEs in the genome (Table 2). Altogether, our results verify that the most maize lncRNAs derive from TEs, but they also indicated that different TE superfamilies give rise to TE-lncRNAs at different rates.
Differential expression under abiotic stress
Numbers of differentially expressed genes, lncRNAs and TE-lncRNAs in maize seedlings under abiotic stress
Among the 1077 non-redundant lncRNA transcripts, many were differentially expressed in two or more treatments. For example, 679 lncRNAs were identified as differentially expressed across V3-V6 stages under heat treatment (Table 3; Fig. 2c; Additional file 4: Table S3, Additional file 1: Fig. S3). Of these, 29 lncRNAs were differentially expressed in all three developmental stages, but 79, 214 and 232 lncRNAs were specific to the V3, V4 and V6 stages, respectively. Interestingly, 40.50% (32/79) heat-responsive lncRNAs at the V3 stage, 26.17% (56/214) heat-responsive lncRNAs at V4 and 42.67% (99/232) heat-responsive lncRNAs at V6 were also differentially expressed in response to other stress treatments, but not shared among developmental stages. These patterns implicate many lncRNAs as a common component of abiotic stress responses, but they also imply that these responses have temporal (i.e., developmental) specificity in leaves from V3 to V6 seedlings.
Interestingly, 529 non-redundant TE-lncRNAs were differentially expressed under one or more conditions. The proportion of differentially expressed TE-lncRNAs was lower than the proportion of all lncRNAs; TE-lncRNAs were 65% of the total proportion of lncRNAs, but constituted only 45 and 56% of up- and down-regulated lncRNAs. Most of the differentially expressed TE-lncRNAs had similarity to LTR/Gypsy and LTR/Copia, as expected, but other TE families also contributed to differentially expressed TE-lncRNAs. For example, MSTRG.32907 exhibited similarities to LINE elements, MSTRG.73329 was similar to DNA/hAT-Ac elements, and MSTRG.37644 was an LTR/Gypsy elements. All of these were differentially expressed in leaves from V3 seedlings, but in different abiotic treatments (heat, cold and salt, respectively) (Fig. 4).
lncRNAs have been shown to be involved in cis regulation of neighboring genes. To investigate this possibility, we examined the correlation in expression between lncRNAs and their closest neighboring gene in either the 5′ or 3′ direction, yielding a dataset of 1077 differentially expressed lncRNAs and their neighboring genes. The lncRNAs were strongly (r = 0.48), and highly significantly (p < 2e-16) correlated with the expression of their closest neighboring gene (Fig. 2b), suggesting that lncRNAs may either be involved in cis regulation or are subject to some of the same cis regulatory features as their neighboring genes.
Co-expression modules associated with stress responses
Compared to coding genes and microRNAs, the function of lncRNAs in abiotic stress response remains largely unknown. Computational construction of gene co-expression networks can be a valuable tool for linking lncRNAs and coding RNAs and also for beginning to infer potential biological functions, because co-expressed genes are often members of the same pathway or protein complexes, are often either functionally related, or are controlled by the same transcriptional regulatory program [33, 39, 40, 41].
Recent work uncovered a temporal transcriptional logic underlying nitrogen (N) signaling in Arabidopsis ; we see similar logic based on developmental timing for abiotic stress responses. Consider the example of heat stress: the ME_tan module was correlated with V3 heat stress (r2 = 0.89, p < 4e-32), the ME_yellow module correlated with V4 heat stress (r2 = 0.96, p < 1e-49), and the ME_darkturquoise (r2 = 0.43, p < 2e-05) and ME_pink (r2 = 0.49, p < 1e-06) modules were associated with heat stress in the V6 stage. These data suggest a developmental cascade of heat-responsive modules. To illustrate this graphically, we arranged the 16 associated modules by stress and development stage. Like heat stress, cold and drought stress were both associated with distinct modules at different developmental stages. There were exceptions, however, as both salt and UV stress associated with two modules in the V4 stage (Fig. 3).
LncRNAs are hubs in modules
An interesting facet of the 16 stress-associated modules is that each contained both lncRNAs and TE-lncRNAs. We have mentioned that the ME_yellow module contained the most lncRNAs of the 16 modules, with 147 lncRNAs and 65 TE-lncRNAs, but other modules were similar in containing lncRNAs. For example, the ME_tan module, which is associated heat stress in V3, contained 26 lncRNAs and 9 TE-lncRNAs. An important question concerns the role of these lncRNAs in expression networks. One role, which is suggested by our results (Fig. 2b), is that some of the lncRNAs in modules are co-expressed with genes due to cis interactions. It is also possible, however, that lncRNAs regulate genes in trans. To investigate this possibility, we screened for key ‘hubs’, which we defined by high connectivity (i.e., intramodular connectivity within the top 10% of all members of the module), membership > 0.9 and high significance (p < 0.01) in the module. Based on these filters, we identified 670 hubs that included 39 lncRNAs from different stress-responsive modules (Additional file 5: Table S4), of which 18 were TE-lncRNAs.
Considering the heat-responsive modules as an example, the 3 associated modules had 27 lncRNAs as hubs, out of 225 total lncRNAs, with 12 of the 27 categorized as TE-lncRNA. The 27 hub lncRNAs included transcript TE-lncRNAs such as MSTRG.32907 (TE-lncRNA, LINE/L1, p < 1.78E-04), MSTRG.35709(TE-lncRNA, LTR/Gypsy, p < 2.59E-114), MSTRG.44074 (TE-lncRNA, DNA/hAT-Ac, p < 2.11E-19) and MSTRG.37268 (TE-lncRNA, DNA/CMC-EnSpm, p < 1.63E-08). In Fig. 4, we illustrate the expression patterns of three of the top-ranked hubs within the heat-stress associated modules, with the top-ranked hubs for the other five abiotic stresses in Additional file 1: Figs. S5–9. All of these hubs are expressed under stress and demonstrate high intramodular connectivity.
These results suggest the possibility that lncRNAs – and more specifically, some TE-lncRNAs – act to regulate abiotic stress responses. If they play a functional role, one would expect them to be conserved over evolutionary time. We tested this idea by blasting each of the 39 hub lncRNAs to an evolutionary gradient of genomes that included sorghum, rice and Arabidopsis (Additional file 9: Table S8). Of the 39, 16 had strong hits (e < 10− 15) to sorghum, a close relative to maize, and 4 of these 16 were TE-lncRNAs. Moreover, three of the hub lncRNAs had hits to rice, but zero TE-lncRNAs had rice hits, and none of the 39 hub lncRNAs had significant hits to Arabidopsis. Overall, these results suggest that ~ 10% these lncRNAs have been conserved since the divergence of rice and maize, roughly 50 million years ago , and that 39% have been conserved since the divergence between sorghum and maize, roughly 16 million years ago .
Testing the reliability of RNA-seq based inferences via qRT-PCR
In this study, we accumulated and mined an expansive dataset to identify lncRNAs in maize, particularly those that are expressed in response to abiotic stress. Bioinformatic analyses led to the identification of 23,309 lncRNAs, the largest collection yet identified from maize. We characterized these lncRNAs with respect to three features: i) their prevalence and origins, especially lncRNAs that appear to be derived from TEs, ii) their expression levels and patterns, including a detectable cis-effect, and iii) their potential for functioning in abiotic stress response, as inferred from the construction of co-expression networks.
lncRNA identification and characterization
By its very nature, lncRNA discovery is limited by a number of factors. It is first, of course, limited by the definition of lncRNAs that have been used in the literature – i.e., an RNA molecule > 200 bp with at most one ORF or overlapping exon of < 100 codons . Following precedence, we have adopted this definition for lncRNA discovery, but it bears remembering that some of these could in fact be translated because they contain short ORFs. A second limitation is the fact that our search strategy did not include lncRNAs that overlapped with (or contained) an annotated exon. We applied this limitation purposefully, to avoid mis-classification based on fragmented RNA molecules or contigs. For that reason, however, our work likely underrepresents lncRNAs derived from genes and so some of our estimates may be inaccurate. For example, if many lncRNAs are derived from genic regions, then our estimate of the proportion of lncRNAs that are derived from TE-lncRNAs is an overestimate. It is worth noting, however, that our estimate of the proportion of TE-lncRNAs (65%) is similar to a previous, smaller maize study that estimated 68% of lncRNAs were derived from TEs . A third limitation is that the completeness of lncRNA discovery relies critically on the number of tissue and developmental samples that are available. With the exception of A. thaliana, for which lncRNA discovery was based on 2000 microarray transcriptomes, most plant studies have been limited to only a handful of samples, suggesting that there is still much to learn about the lncRNA complement within and among plant species. Although our study focuses on only one tissue (i.e., leaves from seedlings of different developmental stages), it greatly expands lncRNA discovery in maize because previously the most RNAseq samples used for lncRNA discovery was 30 .
Our RNA datasets were highly enriched for polyadenylated (polyA+) transcripts, because it consisted of 36 PacBio fl-cDNA datasets, 89 RNAseq datasets and only two total RNA datasets. Nonetheless, fully 44% of intergenic lncRNAs were identified from the total RNA data, representing a disproportionately large number relative to polyA+ data. This observation superficially suggests that far more lncRNAs are polyA-, which is an important point to consider when one considers that most – but not all [6, 49, 50] – lncRNA surveys in plants have relied solely on RNAseq samples and not total RNA samples. Previous work has also suggested that the ratio of polyA- and polyA+ lncRNAs may be a function of growth conditions and external stresses . A fuller understanding of lncRNAs may require more substantial investments in total RNA datasets.
Most lncRNAs are TE-lncRNAs
Given our identification of 23,309 lncRNAs, we next sought to characterize their loci of origin and particularly to identify those that likely originated from TEs. We found that ~ 65% (15,312) of lncRNAs contained similarity to known TEs. Of these, most (61%, 9341 of 15,312) were similar to TEs over > 90% of their length, suggesting they were derived solely from TEs. As we noted above, our estimates of the proportion of TE-lncRNAs could be too high, based on our search strategy. However, it is also not surprising that we identified a high proportion of TE-lncRNAs, for at least three reasons. First, previous studies in mammals have demonstrated that most lincRNAs derive from TEs [7, 35]. Second, the maize genome is replete with TEs, with > 85% of the genome estimated to consist of DNA derived from TEs . Finally, an important function of lncRNAs is to be precursors for small RNAs, which in turn contribute to TE silencing via sequence homology [8, 51, 52, 53].
We also investigated the TE families from which TE-lncRNAs originated. Most of the TE-lncRNAs were derived from LTR/Gypsy and LTR/Copia families (Table 2), reflecting their preponderance in the maize genome [32, 53]. lncRNAs derived from LTR/Gypsy elements were represented in a similar proportion to their genomic proportion (by length) among the TEs we investigated in our study (Table 2). However, LTR/Copia elements were underrepresented in the TE-lncRNA dataset relative to their combined lengths in the genome, 22% versus 33%. This suggests that LTR/Copia elements do not produce lncRNAs as readily as LTR/Gypsy elements, at least within our data. The reasons for the difference between LTR/Copia and LTR/Gypsy are presently unclear, but one can consider two broad categories: TE age and TE location. For the former, older elements might be expected to be in a deeply-silenced epigenetic state that relies primarily on the maintenance of methylation during cell division rather than an active epigenetic response that enlists lncRNAs . For the latter, one might expect LTR/Copia elements to be in genomic locations that are transcribed. In fact, however, the opposite is true, because LTR/Gypsy elements tend to be concentrated in pericentromeric regions  where there may be less active transcription and less ongoing silencing. In contrast, LTR/Copia elements tend to accumulate preferentially in euchromatic regions  that tend to be more transcriptionally active. Class II DNA elements also tend to be located near genes and euchromatic regions, but unlike LTR/Copia elements they produce lncRNAs at about a 2-fold higher than implied by their genomic lengths (Table 2). To sum: we have shown that TE superfamilies over- and under-produce lncRNAs relative to their genomic representation based on our extensive collection of datasets, but the ultimate causes of these differences remain unclear.
Levels and patterns of lncRNA expression
Several previous papers from both plants and animals have shown that lncRNAs tend to be expressed at lower levels than bona fide genes and that they also tend to show tissue-specific patterns of expression [3, 7, 8, 9, 10, 11, 12]. We have verified the former by recording the maximum FPKM for each lncRNA transcript across datasets; on average, lncRNAs are expressed at 4-fold lower levels than genic transcripts by this metric (Fig. 2a). Unfortunately, we cannot verify that lncRNAs have more tissue specific expression than genes, because the bulk of our data were isolated from leaves. We can, however, verify that they have lower entropy than genes, on average (Average Shannon Entropy = 2.10 for coding genes vs. 1.13 for lncRNAS), because the lncRNAs consistently lack expression evidence under more conditions.
Of the 13,822 polyA+ lncRNAs, we found that 1077 (7.79%) were differentially expressed under stress conditions, including 529 TE-lncRNAs. These TE-lncRNAs provided an opportunity to assess whether they could be linked to the expression of nearby genes, indicating some sort of cis-regulatory pattern, as has been observed in other species [20, 55, 56]. TE-lncRNAs were significantly correlated (r2 = 0.48; p < 2.0e-16) with their nearest neighboring genes (Fig. 2b), suggesting that TE-lncRNAs may either be involved in cis regulation or are subject to some of the same cis regulatory features as their neighboring genes, such as open chromatin structure.
lncRNAs, abiotic stress and coexpression modules
This study was designed specifically to identify stress-responsive lncRNAs. We approached this problem in two ways. We first identified differentially regulated lncRNAs from a series of controlled experiments for heat, cold, drought and salt stress. These experiments were based on leaf tissue from seedlings of the V3, V4 and V6 stages. Comparing the stress treatment to their corresponding control at the appropriate developmental stage across 12 different contrasts, we identified 1077 lncRNAs with evidence for differential expression. This observation corroborates previous studies in suggesting that lncRNAs may be differentially regulated under stress [6, 22, 23, 24, 30, 31], but it provides no indication whether the differentially regulated lncRNAs are a byproduct of stress responses or play a functional role. There is, however, a large gap between observing differential expression and proving function. As a first step toward bridging this gap, we have built co-expression networks based on both coding RNAs and lncRNAs from 89 RNAseq datasets, yielding a total of 40 co-expression modules. Of these, 16 were significantly associated with stress responses, and GO annotations of these modules were generally consistent with their inferred response functions. One interesting facet of these 16 modules is that they demonstrate clear patterns across developmental time (Fig. 3), suggesting that temporal hierarchies are important for plant responses to environmental stress.
It is difficult to infer function from co-expression modules , but studies have shown that genes with high connectedness tend to be functionally essential [58, 59]. We were therefore particularly interested whether any of our lncRNAs are included within co-expression networks and particularly whether they are ‘hubs’ within network modules. Of the 16 modules that were significantly associated with stress responses, we identified 670 hubs, many of which corresponded to genes from known transcription factor families (Fig. 5). Of these 670 hubs, 39 were lncRNA transcripts. These represent our best candidates for lncRNAs that function in stress response, potentially as trans-acting regulatory factors. Consistent with this last conjecture, several of these lncRNA hubs were connected to genes from known TF factors [60, 61, 62]. Moreover, ~ 10% of these lncRNAs yielded strong blast hits to rice, suggesting some measure of evolutionary conservation consistent with functional constraint, at least for this subset.
One somewhat surprising finding is that 18 of the 39 lncRNA hubs are related in sequence to – and perhaps derived from - TEs. This observation raises the intriguing idea that TE exaptation can occur at the level of lncRNAs. It is now well known that TE exaptation contributes to many aspects of genome function, including protein coding genes and especially functional regulatory elements [63, 64, 65]. The location of TE-lncRNAs as hubs, along with their connectedness to known TFs, suggests that a small subset of TE-derived lncRNAs may function as trans-acting regulatory factors in maize. If true, these hubs appear to have been recruited recently, given that only four of 16 yield strong hits to the sorghum genome. Clearly additional work is required to prove that these TE-lncRNAs function as hypothesized in abiotic response, but their centrality in co-expression modules is nonetheless an intriguing result that is consistent with previous findings showing that most lncRNAs are derived from TEs  and that lncRNAs can play central regulatory roles in plant and animal development .
In this study, we gathered 36 Pacbio Isoseq datasets that were sampled from different tissues  and 91 illumina RNAseq datasets that were sampled from leaves of maize B73 [6, 67, 68, 69] (Additional file 2: Table S1, Additional file 1: Fig. S1). Of the Illumina datasets, 89 represented polyA+ transcripts and two were based on total RNA, which includes putative polyA- transcripts. The datasets were used for three purposes: lncRNA discovery, differential gene expression analyses, and the inference of gene co-expression networks. All of the 129 datasets were used for lncRNA discovery. A subset of 71 of the 91 RNAseq datasets were employed for differential gene expression analyses (Additional file 2: Table S1); these included replicated control and treatment samples from experiments that tested the effects of drought, salt, heat, cold, UV and ozone treatments on gene expression. Finally, all of the 89 polyA+ Illumina RNAseq datasets were used for inferring gene co-expression networks. The 89 Illumina datasets represented a developmental series sampled from leaves of V3, V4 and V6 seedlings; we take advantage of this developmental series in some network analyses (Additional file 2: Table S1, Additional file 1: Fig. S1).
Data processing and alignment
Raw data were converted into the FASTQ-formatted file by the Fastq-dump program from the SRA Toolkit (https://github.com/ncbi/sratoolkit). For Illumina data, the SolexaQA++ v3.1 program  was employed for quality trimming, using the Q20 value. After trimming, any reads < 50 bp were removed. Cleaned reads were then aligned to the B73 reference genome sequence (v4, http://plants.ensembl.org) using the STAR aligner program  with default parameters. Aligned reads were assembled into transcripts by the StringTie program, using the RABT (reference annotation-based transcript) assembly algorithm . For the Pacbio IsoSeq data, reads were aligned to the B73 reference genome using the Minimap2 program . Unique isoforms were collapsed, based on genome alignment by Cupcake ToFU (https://github.com/Magdoll/cDNA_Cupcake). Subsequently, the assembled transcripts from Illumina RNAseq and Pacbio IsoSeq were merged using StringTie, which yielded a non-redundant unified set of transcripts.
Computational identification of intergenic and intronic lncRNAs
To find lncRNAs, a strict computational strategy was performed as described by Lv et al. (2016) that and consisted of four steps. First, non-redundant transcripts were submitted to annotation programs to evaluate their coding potential. We used two annotation programs – CPC2  and CNCI  – and focused on transcripts that were identified as having no coding potential by both programs as candidate lncRNAs. Second, we submitted candidates to the Pfam database using Pfam_scan script (ftp://ftp.ebi.ac.uk/pub/databases/Pfam/), which aligns transcripts with HMMER . We filtered any transcripts that aligned to known protein families at an Evalue<1e-05. Third, we compared the remaining transcripts to reference annotations using gffcompare , which outputs various codes to designate the relationship of transcripts to annotated coding regions. We retained transcripts with class codes “i”, which indicates that a transcript is fully contained within a reference intron, and “u”, which designates transcripts that are not obviously related to known coding regions, for further analyses. This last step is likely to miss some sense and anti-sense lncRNAs that derive from coding regions but also limit false positives based on incompletely assembled coding transcripts. Finally, we retained transcripts as high confidence lncRNAs if they passed all of the previous four steps, if they were longer than 200 bp, and if they had an FPKM (fragments per kilobase of exon model per million reads mapped) > 1 in at least one of our sample datasets. To determine the relationship of high-confidence lncRNAs to TEs, we masked the lncRNA sequences to identify TE domains. Masking was based on the maize-specific library of Repbase database (www.girinst.org) and was performed by RepeatMasker (www.repeatmasker.org).
Gene expression analyses
We performed two separate types of analyses based on gene and lncRNA expression data. The first analysis was differential expression analysis based on comparisons between stress and control data (Table S1). To perform these analyses, high quality reads were aligned to the B73 reference using the STAR program . For reads that mapped to multiple locations, we removed alignment reads with a mapping quality < 20, based on SAMTools . Raw counts were quantified using the featureCounts program , and the FPKM value per gene was calculated using a custom Perl script. The DESeq2 package  was used to perform pairwise comparisons between samples to identify differentially expressed transcripts. To identify differentially expressed genes (DEG), we relied on two criteria: the Log2(fold change) had to be > 1 and the adjusted p-value from DEseq analyses had to be p-adj < 0.05.
The second type of analysis was the inference of co-expression networks. To construct networks, expression profiles were extracted from each gene and lncRNA, and expression levels were normalized using variance stabilizing transformation in DESeq2 . Co-expression correlations among lncRNAs and genes were based on Pearson correlations with R2 ≥ 0.8 across the 89 RNAseq datasets. An unsigned co-expression network was inferred using the WGCNA package  with an optimal soft threshold = 12. Modules within the network were assigned using Topological Overlap Matrix (TOM). The correlations between modules and stress treatments were calculated and plotted, and then the significant stress-responsive modules were extracted for further analysis. Co-expressed networks were visualized by the Gephi program .
Gene ontology enrichment analysis
The eigengene probes of each stress-responsive module were assigned putative functions by searching against the UniProt protein database . Searching was based on using the Blastx program , using a cut-off e-value ≤ 1e-10. Coding eigengenes were then submitted to the AgriGO v2 online toolkit  for gene ontology term enrichment. A Fisher’s exact test was applied for the enrichment analysis and the p value was adjusted using the Bonferroni method, with an experiment-wide significance level of 0.05.
Experimental stress treatment, RNA extraction and qRT-PCR analysis
The maize inbred line B73 was germinated in a greenhouse at JAAS (Jiangsu Academy of Agricultural Sciences). Seedlings at the three-leaf (V3) stage were then incubated at 50 °C for 4 h for heat stress treatment, as described by Makarevitch et al. . Control plants were retained under a temperature of 25 °C. Leaves from three independent biological replicates were collected and processed for RNA extraction and first strand cDNA synthesis according to PrimeScriptTMRT Master Mix (TaKaRa). qRT-PCR was performed using SYBR Premix DimerEraser™ kits (Takara) on a Real Time PCR System (Roche LightCyclerR 96, USA), according to the manufacturer’s instructions. Quantification results of target transcripts were calculated using the comparative 2-ΔΔCT method. Primers were designed using Primer Primer5  and can be found in Additional file 10: Table S9.
The authors would like to acknowledge the support of High-Performance Computing Cluster at UC Irvine.
BSG and YDL conceived and designed the study. YDL and FQH performed the lncRNA discovery; YFZ and YDL performed the TE annotation. FQH and FLW performed the construction of co-expression network analysis, heat treatment and qRT-PCR experiment. FQH, YDL and BSG wrote the manuscript. All authors read and approved the manuscript.
This work was supported by financial support by National Natural Science Foundation of China (31771813), NSF grant DEB-1655808 to BSG, JAAS Exploratory and Disruptive Innovation Program (ZX (17)2015), National Natural Science Foundation of China (31601315) and Natural Science Foundation of Jiangsu Province (BK20160582). The funding body has no roles in in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
- 10.Wang M, Yuan D, Tu L, Gao W, He Y, Hu H, et al. Long noncoding RNAs and their proposed functions in fibre development of cotton ( Gossypium spp .). 2015;:1181–97.Google Scholar
- 33.van Dam S, Võsa U, van der Graaf A, Franke L, de Magalhães JP. Gene co-expression analysis for functional classification and gene–disease predictions. Brief Bioinform. 2017;19:bbw139.Google Scholar
- 41.Borrill P, Harrington SA, Simmonds J, Uauy C. Identification of transcription factors regulating senescence in wheat through gene regulatory network modelling. Plant Physiol. 2019;:pp.00380.2019.Google Scholar
- 67.Li P, Cao W, Fang H, Xu S, Yin S, Zhang Y, et al. Transcriptomic profiling of the maize (Zea mays L.) leaf response to abiotic stresses at the seedling stage. Front. Plant Sci. 2017;8:290.Google Scholar
- 82.Bastian M, Heymann S, Jacomy M. Others. Gephi: an open source software for exploring and manipulating networks. Icwsm. 2009;8:361–2.Google Scholar
- 86.Lalitha S . Primer Premier 5. Biotech Software & Internet Report, 2000;1(6):270–272.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.