Introduction

Long-read RNA isoform sequencing (Iso-Seq)1,2,3, developed by Pacific Biosciences (PacBio) single-molecule real-time (SMRT) sequencing technology4, allows for complete and high base accuracy sequencing of cDNA fragments containing thousands of bases5. Iso-Seq can identify RNA isoforms without read assembly and thus is widely used to improve gene annotation in newly generated or highly fragmented plant genome sequences6,7,8,9,10,11,12. Furthermore, Iso-Seq offers advantages of high base accuracy and long read length, making it highly sensitive for analyzing RNA alternative splicing, which is an important foundation for plant adaptation to internal and external environmental changes. Improving the annotation of alternative splicing transcripts is an important goal of gene annotation. Therefore, gene annotation using Iso-Seq data not only improves the accuracy of gene expression quantification, but also allows for accurate analysis of RNA alternative splicing at the transcript level. Currently, Iso-Seq has been widely used for researching RNA alternative splicing related to tissue specificity and stress response in plant13,14,15,16,17,18,19,20,21.

Despite its significant research and application potential, the widespread adoption of Iso-Seq is constrained by its considerable cost, limiting its scalability for sample sequencing. The PacBio Sequel II sequencer offers high-fidelity (HiFi) sequencing capabilities for 15–20 kb library molecules through circular consensus sequencing (CCS) mode22. Its throughput, ranging from 15 to 20 kb, surpasses the average length of conventional plant transcripts. For instance, the typical length of lettuce reference gene transcripts is approximately 1.64 kb (Table 1). Deploying Iso-Seq analysis on a single lettuce sample may result in substantial underutilization of the PacBio Sequel II sequencer in terms of read length. An alternative approach involves concatenating short amplified fragments into longer molecules for sequencing, presenting a viable method to enhance sequencing throughput and alleviate analysis costs on the PacBio Sequel II sequencing platform. Notably, we have recently developed HIT-scISOseq23 techniques tailored for single-cell long-read transcriptome sequencing, utilizing the PacBio Sequel II platform. This method is based on the concatenation of barcoded cDNA generated by the 10× Genomics system.

Table 1 Performance of HIT-ISOseq in lettuce samples

Lettuce is among the most widely grown and consumed dicotyledonous vegetables worldwide. It is characterized by easy cultivation and high nutritional value, with an annual production value of $2.4 billion in the United States24. Lettuce is a diploid plant of the Asteraceae family, with 18 chromosomes and an estimated genome size of 2.5 GB25,26. However, its reference genome’s corresponding gene annotation does not include long-read RNA sequencing data25,26,27. The limitations of the reference gene annotation hinder research and applications of the genomic lettuce resources, calling for a feasible solution.

In this study, we employed the HIT-scISOseq approach, referred to as HIT-ISOseq, to perform long-read RNA isoform sequencing of bulk samples by concatenating long molecules for sequencing. This method was applied to pool samples from distinct bulk lettuce tissues, namely root, stem, and leaf, each replicated twice. The results showed that HIT-ISOseq facilitated the generation of 15.79 million long-read cDNA sequences from a single Sequel II SMRT Cell 8 M chip, effectively distributed across the six pooled samples. Notably, each sample yielded a balanced output, ranging from 2.55 to 2.74 million full-length non-chimeric (FLNC) reads. The transcriptomic data obtained through HIT-ISOseq allowed robust clustering of the six lettuce samples based on tissue type. The gene and transcript expressions derived from HIT-ISOseq data exhibited a high correlation with the results obtained through Next-Generation Sequencing (NGS), validating the gene expression quantification. Additionally, leveraging HIT-ISOseq data led to the augmentation of 51,558 transcript annotations, incorporation of 119 GO terms, and identification of 25 KEGG pathways. The HIT-ISOseq data enabled the identification of 2611 genes exhibiting tissue-specific expression and 4,842 transcripts displaying tissue-specific expression patterns in lettuce. Furthermore, we characterized 18 isoforms manifesting distinct expression patterns in contrast to their corresponding genes and performed qPCR validation for these isoforms. This study’s findings provide a comprehensive long-read transcriptomic atlas encompassing various lettuce tissues, offering invaluable resources for advancing lettuce genome research.

Results

Experimental design

HIT-ISOseq differs from HIT-scISOseq as it starts its workflow from total RNA extraction of tissue samples instead of cDNA generated by the 10× Genomics system. For HIT-ISOseq, we devised a primer pair with an upstream sequence containing a dU base, facilitating cDNA amplification after RNA extraction from bulk samples and reverse transcription of polyA+ RNA into double-stranded cDNA. The cDNA ends were then enzymatically digested using the USER enzyme to generate a sticky end featuring a 6 bp palindromic sequence structure. The digested cDNA molecules were subsequently ligated using T4 DNA ligase. After a secondary round of PCR amplification, the ligated products underwent SMRTbell library preparation utilizing the PacBio official barcoded adapter kit and were sequenced via a Sequel II instrument (Fig. 1A). Following sequencing, the Circular Consensus Sequencing (CCS) reads of each sample were obtained by demultiplexing the pooled sample barcode CCS reads using PacBio’s Lima software. These CCS reads were further processing to isolate the final full-length non-chimeric (FLNC) reads, accomplished through the deconcatenate program developed by the authors (refer to the Methods section). For this investigation, various tissues-root, stem, and leaf-derived from the Italian lettuce cultivar, were chosen as the pooling samples for HIT-ISOseq sequencing, with two biological replicates for each tissue.

Fig. 1: Overview of the workflow and the performance of HIT-ISOseq.
figure 1

A The schematic diagram of the experimental steps of HIT-ISOseq including 6 main steps: 1) Collecting two biological replicates of lettuce roots, stems, and leaves, and extract total RNA from each sample; 2) Reverse transcription of the RNA from each sample to generate first-strand cDNA, followed by amplification of the first-strand cDNA using UMI-labeled PCR primers containing a 5’ dU base; 3) Digestion of the amplified double-stranded cDNA products with USER enzyme to generate the sticky ends; 4) Independent ligation of multiple cDNA fragments from each sample using T4 DNA ligase; 5) Construction of SMRTbell sequencing libraries for each sample by adding different barcoded adapters to the ligation products, followed by pooling of the six sample libraries; 6) Sequencing of the pooled library using PacBio CCS on the Sequel II platform. B Density plot showing the relationship between the CCS read length and the number of FLNC cDNA fragments obtained for each sample. C Bar plot depicting the relationship between the numbers of CCS reads and the number of FLNC fragments obtained for each sample. D Box plot illustrating the distribution of genomic alignment coverage and identity for the FLNC fragments from each sample. The center line: median; boxes: first and third quartiles; whiskers: 5th and 95th percentiles. E Density plot comparing the length distribution of unique isoforms from each sample with the length distribution of reference annotated transcripts. F Bar plot showing the classification of the number of exons, coding/non-coding status, and junction types for the merged set of unique isoforms (n = 69973) derived from the HIT-ISOseq data. G Tissue-type clustering of gene expression levels for the HIT-ISOseq samples.

Performance of sequencing runs

We evaluated the performance of the HIT-ISOseq dataset in terms of yield, quality, and sample uniformity using three indicators: CCS Reads, FLNC Reads, and Gene and Isoform Feature Detection (Table 1).

Regarding CCS Reads, the HIT-ISOseq generated 4.56 M reads in a single SMRT Cell 8 M chip, with an average distribution of 0.71 to 0.84 M CCS reads for each sample. The average numbers of CCS passes and quality values (QV) for each sample were in the range from 17 to 20 and 0.97 to 0.98, respectively. The average length of CCS reads for each sample (4.04 to 4.86 kb) was 2.45–2.95 times longer that (1.65 kb) of the existing lettuce reference transcripts (Table 1, Fig. 1B).

Regarding FLNC reads, a total of 15.79 M reads were produced, with an average of 3.46 FLNCs per CCS read. Each sample obtained an average of 2.55 to 2.74 M FLNC reads, with the average read length ranging from 736.31 to 779.97 bp (Table 1, Fig. 1C). The length distribution of FLNC reads among different samples showed a high degree of similarity (Supplementary Fig. 1A). It is worth noting that the average length of FLNC reads is much lower than that of lettuce reference transcripts, indicating that HIT-ISOseq may tend to concatenate shorter fragments. Additionally, we found that the longer CCS reads did not result in more FLNC molecules. In fact, the numbers of FLNC molecules first increased and then decreased with the increasing length of the CCS reads, showing an inverted “U” relationship. The highest number of FLNC molecules was observed at ~10 kb of CCS length (Fig. 1B). Based on the genomic mapping matrices of FLNC reads, an average of 95.3% of FLNC reads could be mapped to the genome, with mapping coverage exceeding 98% and mapping identity up to over 97.8% (Fig. 1D).

In terms of Feature Detection, HIT-ISOseq identified a total of 31,297 genes and 69,973 transcripts, with the numbers of genes and transcripts ranging from 23,550 to 27,629 and from 54,711 to 58,998 per sample, respectively. The detected isoform lengths using HIT-ISOseq generally covered the length range of lettuce isoforms, although the average length (1.56–1.59 kb) of the detected isoforms was shorter than that (1.65 kb) of the reference lettuce annotated transcripts (Fig. 1E). The majority of the detected isoforms (90.87%) exhibited a multi-exonic structure, with most (89.27%) being coding transcripts. Furthermore, the splice sites of these isoforms were predominantly canonical (99.97%, Fig. 1F, Supplementary Fig. 1B–D). Additionally, our method successfully clustered the lettuce samples based on tissue type at the level of gene expression (Fig. 1G).

HIT-ISOseq improves the annotation of reference transcripts and functional annotations for lettuce

We performed quality control and filtering of the non-redundant isoforms generated by HIT-ISOseq using SQANTI3. This process retained a total of 197,363 isoforms (Supplementary Fig. 1B–D). To ensure reliable downstream differential expression analysis, we further filtered out isoforms with a total UMI count lower than 5 across all six samples, resulting in 69,973 isoforms (Figs. 1F2A). These isoforms were classified as FSM (full splice match: isoforms that match reference annotations), ISM (incomplete splice match: isoforms whose internal junction sites agree with reference annotations and but 5’ and/or 3’ exons have truncations compared to reference annotations), NIC (novel in catalog: isoforms that have not been annotated but use a combination of known splice sites and exons), and NNC (novel not in catalog: isoforms that contain at least one splice site not annotated).

Fig. 2: Updating of lettuce reference annotation by HIT-ISOseq.
figure 2

A Bar plot showing the SQANTI classification of isoforms (n = 69973) identified by HIT-ISOseq in the six lettuce samples. B Bar plot comparing the changes in the number of transcripts before and after annotation update by HIT-ISOseq. C Venn diagram depicting the annotation status of transcripts in the NR, Pfam, SwissProt, KOG, and KEGG databases after annotation update by HIT-ISOseq. D Pie chart illustrating the database annotation status of transcripts after annotation update by HIT-ISOseq, including the overall annotation status and representative annotations from the NR, GO, and KEGG databases. E Changes in the number of GO terms included before and after annotation update by HIT-ISOseq. F Changes in the number of KEGG pathways included before and after annotation update by HIT-ISOseq.

To enhance transcript annotation accuracy, we integrated the long-read transcriptome data obtained from HIT-ISOseq with the reference transcripts of lettuce genes. Following SQANTI3 quality control and filtering, we obtain 197,363 HIT-ISOseq non-redundant isoforms. These isoforms were merged with the reference gene annotation of NCBI Lsat_Salinas_v7 using StringTie software, resulting in a new lettuce transcriptome in GFF format. We conducted thorough validation to ensure the consistency of merging HIT-ISOseq transcripts using different versions of the reference genome and gene annotation. Results demonstrated that the lettuce gene annotation has not significantly updated at present (Supplementary Fig. 2). Evaluation of the new lettuce transcriptome GFF was based on three criteria: transcript number, transcript structure, and functional database annotation (Fig. 2B–F, Supplementary Fig. 3B–D). After merging, the new lettuce transcriptome GFF exhibited increased transcript number, totaling 117,608 transcripts (Supplementary Table S3), representing an addition of 51,558 transcripts compared to the lettuce reference transcripts (Fig. 2B). Notably, HIT-ISOseq data added 1285 new genes, comprising 1988 transcripts, to the gene annotation of Lsat_Salinas_v7. These new genes do not overlap with the existing annotation region, corresponding to intergenic regions.

Afterward, the newly obtained lettuce transcripts were subjected to annotation against the NR (NCBI non-redundant protein database), Pfam (Protein families database), Swiss-Prot, KOG (Eukaryotic Orthologous Groups of proteins), and KEGG (Kyoto Encyclopedia of Genes and Genomes) databases. Remarkably, a substantial portion (96.9%) of the transcripts were successfully annotated in at least one of these databases (Fig. 2C). Specifically, 93.2% of the transcripts were annotated in the NR database (Fig. 2D), with the majority (92.03%) of these assigned to lettuce (Supplementary Fig. 3A). Additionally, 71.0% and 70.7% of the transcripts in the new lettuce GFF could be annotated in the GO and KEGG databases, respectively (Fig. 2D). In comparison to the annotation results obtained using the lettuce reference transcripts, the adoption of the new lettuce GFF resulted in an augmentation of 119 and 25 terms in the GO and KEGG annotations, respectively (Fig. 2E, F).

HIT-ISOseq deciphers tissue-specific expressed genes in lettuce

To assess the performance of HIT-ISOseq in gene expression quantification, the identified gene counts from HIT-ISOseq data across six lettuce tissue samples were subjected to saturation curve analysis. The results revealed saturation in gene identification when the number of FLNC reads exceeded 1 million (Fig. 3A). Considering that the throughput of the HIT-ISOseq (~15 million reads) was much higher than the identifiable gene numbers of a single plant tissue sample, it had high potential to simultaneously analyze plenty of plant samples in a single SMRT Cell 8 M chip. Correlation analysis was conducted between the HIT-ISOseq and NGS RNA-seq (Supplementary Table S4) in gene expression quantification, with the correlation coefficients of all samples > 0.923 (Fig. 3B). This result indicated the high accuracy of the HIT-ISOseq in the gene expression quantification. Subsequently, HIT-ISOseq data were utilized to identify tissue-specific expression genes (TSPGs) in lettuce, revealing 976, 259, and 1,338 TSPGs in leaf, root, and stem samples respectively. Furthermore, GO and KEGG functional enrichment analyses were performed on the TSPGs, using the re-annotated databases in this study (Fig. 3D, E, Supplementary Fig. 4A, B), highlighted enrichment of photosynthesis, aquaporin TIP, and the auxin-activated signaling pathway of TSPGs were enriched in leaf, root, and stem samples, respectively.

Fig. 3: Identification of tissue-specific expressed genes in lettuce using HIT-ISOseq.
figure 3

A Line graph showing the variation in the number of genes detected in each sample under different levels of HIT-ISOseq FLNC number (x-axis). B Scatter plot depicting the correlation between gene expression TPM values from NGS sequencing and gene expression CPM values from HIT-ISOseq. C Volcano plot illustrating differentially expressed genes in the root, stem, and leaf tissues, with the highlighted boxes representing the top 3 upregulated (red) or downregulated (blue) genes in each tissue. D Heatmap displaying tissue-specific expressed genes in the root, stem, and leaf tissues. E Dot plot showing enrichment of GO biological processes for tissue-specific expressed genes in the root, stem, and leaf tissues. F Dot plot presenting enrichment of KEGG pathways for tissue-specific expressed genes in the root, stem, and leaf tissues.

HIT-ISOseq deciphers tissue-specific expressed isoforms in lettuce

Besides gene expression quantification, HIT-ISOseq also excels in quantifying RNA isoform expression. The lettuce samples could be effectively clustered based on the tissue type using RNA isoform expression quantification (Fig. 4A). Consequently, tissue-specific isoforms of lettuce in roots, stems, and leaves could be discerned by analyzing isoform expression data from each sample. In this study, we identified 1651, 342, and 2849 tissue-specific isoforms in root, stem, and leaf samples, respectively (Fig. 4B). Notably, upon integrating the expression of tissue-specific isoforms with their corresponding genes (Fig. 4C), we observed that the expression patterns of 18 isoforms diverged from those of their respective genes (Supplementary Table S5), as illustrated in Fig. 4D. Furthermore, the identified isoforms exhibited exon structures notably distinct from the reference annotated transcripts and principal isoforms of their associated genes, depicted in Fig. 5A, Supplementary Fig. 5A, Supplementary Fig. 6 and Supplementary Fig. 7. Enrichment analyses via GO and KEGG pathways demonstrated the involvement of these isoforms in pivotal biological processes, including cell wall organization, channel activity, and participation in intracellular metabolic processes (Supplementary Table S6). This outcome underscores HIT-ISOseq’s capacity to unveil intricate regulatory expression dynamics at the transcript level with remarkable resolution.

Fig. 4: Identification of tissue-specific expressed isoforms in lettuce using HIT-ISOseq.
figure 4

A Tissue-type clustering of isoform expression levels determined by HIT-ISOseq for each sample. B Heatmap showing tissue-specific expressed isoforms in the root, stem, and leaf tissues. C Mfuzz trend analysis combining gene expression and isoform expression data from HIT-ISOseq. D Comparison of 18 isoforms with different expression trends identified by HIT-ISOseq compared to their corresponding gene expression trends.

Fig. 5: Validation of tissue-specific expressed isoforms.
figure 5

A Comparative exon structure diagram showing the tissue-specific expressed isoform (green), its corresponding reference transcript of the gene (blue), and the major expressed isoform (red) in HIT-ISOseq. B Line graph illustrating the expression levels of the tissue-specific expressed isoform (green), its corresponding reference transcript of the gene (blue), and the major expressed isoform (red) across different samples in HIT-ISOseq. C qPCR quantification results of the tissue-specific expressed isoforms in different samples (n = 3) from HIT-ISOseq. Data are presented as mean values +/- SD. Statistical significance was calculated by two-sided Student’s t-test, and adjusted P-value was calculated using Bonferroni-Holm correction method.

Experimental validation of tissue-specific isoform expression

To validate the fidelity of HIT-ISOseq in detecting tissue-specific expression isoforms, we conducted qPCR validation on six isoforms. These isoforms were chosen based on their distinct tissue-specific expression patterns from the previously identified pool of 18 isoforms (refer to Supplementary Table S2). The expression profiles of these six isoforms were consistently observed in both HIT-ISOseq and qPCR results, as depicted in Fig. 5B, C, Supplementary Fig. 5B, C. This consistent observation confirms the precision and robustness of HIT-ISOseq. This outcome provides evidence that HIT-ISOseq not only accurately identifies tissue-specific expression isoforms but also reliably quantifies their expression levels. As a result, HIT-ISOseq is expected to have broad utility in functional investigations and biological interpretations of plant samples. It serves as a dependable tool for delving deeper into the regulatory mechanisms of tissue-specific gene expression and facilitating functional annotation.

Discussion

Lettuce stands as one of the most globally favored and extensively consumed vegetables, rendering the investigation into its genetic variations profoundly significant. Despite the sequencing of the reference genome of cultivated lettuce (Lactuca sativa L.), the current annotation of lettuce genes remains incomplete, impeding comprehensive research and the broad utilization of lettuce genomic resources. Recent advancements in long-read sequencing technology have positioned it as a prevailing method in plant genome assembly. Additionally, long-read DNA sequencing and long-read RNA isoform sequencing (Iso-Seq) offer substantial advantages in analyzing RNA alternative splicing and aiding gene annotation. However, their relatively lower throughput constrains their widespread application on a larger scale.

In this study, we employed a PacBio Circular Consensus Sequencing (CCS) method termed HIT-ISOseq, which involves concatenating multiple cDNA molecules to achieve high-throughput and high-precision RNA isoform sequencing across diverse samples. Using HIT-ISOseq, we conducted sequencing on the root, stem, and leaf tissues of lettuce (Lactuca sativa L.), providing valuable insights into tissue-specific gene expression and RNA isoforms in lettuce. By harnessing multi-tissue long-read RNA sequencing, we refined the transcript annotation of the lettuce reference genome, significantly expanding genomic annotations covering GO and KEGG pathways. Through the utilization of HIT-ISOseq data, we identified tissue-specific expressed genes and isoforms within lettuce’s root, stem, and leaf tissues. Integrating tissue-specific isoforms with their corresponding gene expression profiles uncovered distinctive expression patterns in tissue-specific isoforms.

Furthermore, the throughput or yield of HIT-ISOseq can be further enhanced by increasing the number of concatmers. Notably, while longer CCS reads not always correlate with higher FLNC molecule counts, the highest number of FLNC molecules was observed under ~10 kb CCS conditions (Fig. 1B). Improving HIT-ISOseq yield and average FLNC read length can be achieved by selectively removing shorter connectors (e.g., < 5 kb) through gel cutting or BluePippin size selection. It is expected that the HIT-ISOseq will enable population-scale long-read RNA sequencing for larger sample sizes in the future, with the introduction of the latest PacBio Revio SMRT Cell 25 M chip and the increase in available concatmers. Additionally, while this study was an early proponent of a concatenation approach for sequencing multiple bulk samples, it’s noteworthy that the recent launch of PacBio’s official Kinnex concatenation library preparation kit (based on the MAS-ISO-seq28 method) is gradually establishing concatenation-based methodologies as a mainstream option in Iso-Seq applications.

In conclusion, this study conducted a thorough investigation into lettuce genetics, highlighting the persisting gaps in gene annotation despite the exhaustive sequencing of the cultivated lettuce genome. Leveraging HIT-ISOseq, enabled a comprehensive exploration of RNA isoforms across diverse lettuce tissues. This exploration resulted in the unveiling of tissue-specific gene expression patterns and a significant enhancement in the transcript annotation of the lettuce genome. This refinement notably enriched the genomic annotations across pivotal pathways, including GO and KEGG. Furthermore, valuable insights were gained into strategies to boost HIT-ISOseq throughput, suggesting its potential scalability for larger-scale RNA sequencing using advanced technologies. While the principal aim of this study is not to provide a reference-level gene annotation for lettuce, our long-read RNA-seq data significantly enhance the annotation of lettuce transcripts. With the establishment and enhancement of lettuce germplasm gene resource databases, such as LettuceGDB29 and LettuceDB30, we envision that the HIT-ISOseq data generated in this study will facilitate exploration of lettuce genetic resources. Additionally, our study also offers an effective strategy to improve gene prediction and enhance genome annotation for other plants with incomplete gene annotation. In the future, long-read RNA sequencing of multi-tissue samples using HIT-ISOseq or PacBio Kinnex can be employed to increase the number of full-length transcripts beyond the current annotated gene boundaries and discover novel genes within intergenic regions of annotated genes.

Methods

Plant material and growth condition

Seeds of the Italian lettuce were evenly placed into a substrate for germination. After having four true leaves, the lettuce seedlings with healthy and equal growth performance were selected and transported to 1/2 Hoagland’s nutrient solution for hydroponic cultivation (Supplementary Table S1). The hydroponic cultivation lasted for 4 weeks in a greenhouse at natural temperature (22 ± 3 °C), with the nutrient solution renewed every 7 days. During the cultivation, the lettuces gradually grew to be mature. Six mature lettuce plants were randomly selected and divided into two biological replicate groups, labeled as s1 and s2, with three plants per group. After washing with ultrapure water for three times, the selected plants were respectively cut into three parts, including roots, stems, and leaves. The obtained plant tissues were immediately placed into the sterile and non-enzyme frozen pipe for quick-freezing with liquid nitrogen, and then kept at −80 °C for subsequent NGS RNA-seq and HIT-ISOseq.

RNA isolation and cDNA preparation

Total RNA was extracted from each sample using the QIAGEN RNeasy Mini Kit (#74104). To prepare cDNA, 1 μg of RNA was utilized along with SuperScript™ II Reverse Transcriptase (Invitrogen) and an RT primer (5’-AAGCAGTGGTATCAACGCAGAGTACNNNNNNNNTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTVN-3’). The 12 µL final reaction mixture contained RT primer, dNTP mixture, and total RNA, initially incubated at 65 °C for 5 minutes followed by 42 °C for 2 minutes. Subsequently, 5× first-strand buffer, recombinant RNase inhibitor, 0.1 M DTT, SuperScript™ II RT, and nuclease-free water were added to achieve a final volume of 19 µL. Reverse transcription proceeded at 42 °C for 60 min, followed by addition of 1 µL of Template Switching Oligo (TSO, 5’-AAGCAGTGGTATCAACGCAGAGTACATrGrG+G-3’) and further incubation at 42 °C for 30 minutes. The resulting cDNAs were purified using 0.8 volumes of Agencourt AMPure XP Beads (Beckman Coulter) and eluted in 20 μL nuclease-free water.

NGS RNA-seq library preparation and sequencing

The sequencing library was prepared using the MGIEasy mRNA library prep kit. The libraries were sequenced on an MGISEQ-2000 sequencer (BGI Genomics, Wuhan, China) in paired-end 150 mode. Reads that passed the MGI quality filters were retained for further analysis. Reads containing adapter or low-quality sequences were filtered out. To eliminate contamination and obtain passed reads, a series of data processing steps was performed using SOAPnuke31. The parameters for the SOAPnuke software were configured as follows: “-n 0.01 -l 20 -q 0.4 --adaMR 0.25 --ada_trim --polyX 50”. The specific filtering conditions are as follows: 1) Trimming adapter: Reads matching the adapter sequence by 25% or more (allowing up to 2 mismatches) will undergo adapter sequence trimming; 2) Filtering low-quality data: Reads will be discarded if 40.0% or more of the bases have a quality value lower than 20; 3) Removal of Ns: Reads will be eliminated if the proportion of Ns exceeds 1.0% of the total read length; 4) Removal of polyX: Reads with a polyX tract (where X can be A, T, G, or C) longer than 50 bp will be discarded entirely.

Amplification of cDNA for PacBio library construction

The purified cDNA products were subjected to 12 cycles of amplification using KAPA HiFi HotStart Uracil 2× ReadyMix (Kapa Biosystems) and PCR primers incorporating deoxyUracil: 5’-ACTAGUAAGCAGTGGTATCAACGCAGAG -3’. The resulting amplified cDNA products were then purified with 0.8 volumes of Agencourt AMPure XP Beads and eluted in 16 μL of nuclease-free water. Quantification was performed using Qubit dsDNA HS Assay Kits (Thermo Fisher), and the quality was evaluated using Agilent 2100 DNA HS Assays.

USER cloning-based ligation of multiple inserts

The amplified cDNA was treated with 1 μL of USER Enzyme (NEB) and 2 μL of 10× T4 DNA ligase buffer (NEB), then incubated at 37 °C for 20 minutes. During this time, the USER enzyme cleaved at the deoxyuracil sites to create 3’ palindrome overhangs. Next, 1 μL of T4 DNA ligase (NEB, 400,000 U mL¹) was added, and the reaction mixture was incubated at 16 °C for 10 minutes to promote insert ligation. The resulting multi-insert library was purified using 0.4 volumes of Agencourt AMPure XP Beads (Beckman Coulter), followed by end-repair and A-tailing with the NEBNext Ultra II End Repair/dA-Tailing Module, involving incubation at 20 °C for 15 minutes and 65 °C for 30 minutes. The cDNA was then ligated with 2 μL of a dT-overhang selection adapter (10 μM, annealed with primers 5’-GAACGACATGGCTACGATCCGACTT-3’ and 5’ PHO-AGTCGGATCGTAGCCATGTCGTTC-3’) using the NEBNext® Ultra™ II Ligation Module (NEB) at 20 °C for 15 minutes, and purified again with 0.4 volumes of Agencourt AMPure XP Beads (Beckman Coulter). Subsequently, 100 ng of the purified products were PCR amplified for 8–10 cycles using KAPA HiFi HotStart 2× ReadyMix and a selection primer (5’PHO-GAACGACATGGCTACGATCCGACTT-3’) to ensure the multi-insert library had no ligation nicks. Finally, the amplified products were purified once more using 0.4 volumes of Agencourt AMPure XP Beads (Beckman Coulter) and assessed using Agilent DNA 12000 Assays.

Preparing PacBio SMRTbell templates and conducting sequencing

The ligated cDNA PCR products were subjected to end repair and A-tailing using the NEBNext End Repair/dA-Tailing Module. Following this, the products were ligated with a dT-overhang hairpin adapter via the NEBNext® Ultra™ II Ligation Module (NEB) and purified with 0.4 volumes of Agencourt AMPure XP Beads (Beckman Coulter) to create the SMRTbell template. To remove residual adapters and unligated DNA fragments, the library was treated with 1 μL exonuclease I (NEB), 1 μL exonuclease III (NEB), and NEBuffer 1 (NEB), followed by incubation at 37 °C for 1 h. The purified products were then processed with 0.8 volumes of Agencourt AMPure XP beads, eluted in 15 μL elution buffer (10 mM Tris-HCl, pH 8.0), and quantified using Agilent DNA 12000 Kits (Agilent). Sequencing primer annealing and polymerase binding to the PacBio SMRTbell templates were conducted following the manufacturer’s protocols (PacBio, US). Finally, the library complex was sequenced using the SMRT Cell 8 M (PacBio) compatible with the Sequel II sequencer.

Validation of Isoform Expression by RT-qPCR

RNA extraction and cDNA synthesis: Total RNA was isolated from each sample using the QIAGEN RNeasy Mini Kit (#74104) following the manufacturer’s instructions. Next, 50 ng of total RNA was reverse transcribed into cDNA using SuperScript™ II Reverse Transcriptase (Invitrogen) and oligo dT primers. In brief, a 12 µL mixture containing oligo dT primers, dNTPs, and total RNA was heated to 65 °C for 5 minutes and then quickly cooled on ice. Afterward, 5× first-strand buffer, recombinant RNase inhibitor, 0.1 M DTT, and SuperScript™ II RT were added, bringing the final volume to 20 µL. The reverse transcription reaction was performed at 42°C for 50 minutes and then inactivated at 70°C for 15 minutes. The resulting cDNAs were then diluted to a final volume of 60 μL with nuclease-free water.

RT-qPCR analysis: The RT-qPCR experiments were performed on 96-well plates (Axygen) using the StepOnePlus system (Applied Biosystems). Primer sequences and additional details are provided in Supplementary Table S2. Each reaction mixture comprised 5 μL of TB Green Premix Ex Taq II (Tli RNase H Plus) (Takara, #RR82WR), 1 μL of 5 μM primer mix, 1 μL of diluted cDNA, and 3 μL of nuclease-free water. The thermal cycling conditions were set as follows: an initial denaturation at 95°C for 1 minute, followed by 40 cycles at 95°C for 10 seconds, 52°C for 30 seconds, and 68°C for 30 seconds. All RT-qPCR assays were performed in triplicate, with three biological replicates for each sample.

Short-read RNA-seq data analysis

For each sample, we initially aligned the clean reads to the Lactuca sativa L. genome and transcriptome (NCBI Lsat_Salinas_v7) using STAR (version 2.7.2a)32 with the following parameters: “--quantMode GeneCounts”. We then utilized featureCounts (version 2.0.1)33 to obtain the read count for each gene, the parameters were set as follows: “-p -t exon -g gene_id”. The read counts obtained were subsequently normalized to CPM (Counts per Million) and TPM (Transcripts per Kilobase Million) values.

Generation of CCS reads

CCS reads were produced using SMRT-Link (version 10.1.0.119588) with adjusted parameters: “--min-passes 0 --min-length 50 --max-length 21000 --min-rq 0.75”. Subsequently, the resulting CCS reads were demultiplexed using the Lima program included in SMRT-Link (version 10.1.0.119588) by grouping sample barcodes.

Generation of FLNC reads

To generate FLNC reads, the following steps were taken: (1) Mapping Primers to CCS Reads: The 5’ and 3’ primers were mapped to the CCS reads using NCBI BLAST (version 2.10.0 + )34 with the parameters: “-outfmt 7 -word_size 5”. (2) Extracting UMIs: Primer BLAST results were used as inputs for the classify_by_primer utility to extract Unique Molecular Identifiers (UMIs). (3) Generating FLNC Reads: The final FLNC reads were generated using the parameters: “-umilen 8 -min_primerlen 16 -min_isolen 50”. The classify_by_primer utility performed the following functions: (1) Parsed the 5’ and 3’ primers in the CCS reads to obtain FLNC reads, orienting them from the 5’ to the 3’ end. (2) Trimmed the 5’ and 3’ primer sequences, along with the 8 bp sequences following the 3’ primers as UMIs. (3) Trimmed the 3’ polyA tail using a sliding window algorithm. The program ensured that each read was correctly oriented by strictly pairing the 5’ and 3’ primers. Reads that included primers, UMIs, and polyA tails were considered FLNC reads.

Genome alignment of FLNC reads

After identifying and trimming the primers, UMIs, and polyA tails from the FLNC reads, the remaining sequences were aligned to the Lactuca sativa L. genome (NCBI Lsat_Salinas_v7). This alignment was performed using minimap2 (version 2.20-r1061)35 in spliced alignment mode, employing the parameters: “-ax splice -uf --secondary=no”.

Generation of the gene count matrix

Following the alignment of FLNC reads to the genome, gffcompare (version 0.11.6)36 was utilized to assign the FLNCs were assigned to NCBI Lactuca sativa L. annotation gene models (NCBI Lsat_Salinas_v7). The gene_quantify utility was then employed to produce gene expression data for each sample, leveraging the output from gffcompare and the UMI associated with each FLNC read.

Collapsing redundant isoforms

We utilized the Python script “collapse_isoforms_by_sam.py” from the cDNA_Cupcake package (https://github.com/Magdoll/cDNA_Cupcake). The default parameters for minimum alignment coverage (--min-coverage) and minimum alignment identity (--min-identity) were set to 0.99 and 0.95, respectively. This procedure ensures the production of highly accurate transcripts.

Nonredundant isoform quality filtering, coding frame prediction, and UTR detection

To analyze and ensure the quality of nonredundant isoforms, SQANTI337 was employed, utilizing NCBI Lactuca sativa L. annotation gene models (NCBI Lsat_Salinas_v7). Isoforms were categorized as either known or novel. For the prediction of CDS coding frames and the identification of UTRs in nonredundant isoforms, SQANTI3 called upon GeneMarkS-T (version 5.1, March 2014).

Updating reference annotation with Iso-Seq isoforms using StringTie

After the collapsing and SQANTI3 quality filtering procedure, the HIT-ISOseq isoforms were integrated with the NCBI Lactuca sativa L. annotation gene models (NCBI Lsat_Salinas_v7 and Lsat_Salinas_v11) using StringTie (version 2.2.1)38 with the following parameters: “--merge input.gff -G ref.gff”. The output of this process was a GTF file containing the updated gene models.

BUSCO assessments for transcript annotations

To evaluate the quality of transcript annotation for both the reference and updated gene models of NCBI Lsat_Salinas_v7 and Lsat_Salinas_v11, we conducted a BUSCO analysis (version 5.4.7)39. This was done using the BUSCO transcriptome mode (-m transcriptome) with OrthoDB v10, enabling auto lineage selection (--auto-lineage-euk). The results were summarized and visualized using the ggplot2 package in R.

Functional annotation analysis

We performed annotation on the HIT-ISOseq updated transcripts using seven functional databases. Transcripts were annotated with the Pfam (v14.6)40 library using hmmscan (v3.1b2)41. Blastn (v2.2.23)34 was used to annotate transcripts with the NT database (downloaded on November 17, 2021). Diamond (v0.8.31)42 was employed to annotate transcripts with NR (downloaded on November 17, 2021), KOG (v20090331)43, KEGG (v101)44, and Swiss-Prot (release-2020_02)45. Blast2GO (v2.5.0)46 and NR annotation results were utilized for GO47 annotation.

Generation of the isoform count matrix

Following the collapsing and SQANTI3 quality filtering processes, we employed the isoform_quantify utility to determine the expression levels of isoforms in each sample. Isoforms detected with fewer than 5 UMIs across all samples were subsequently filtered out.

Differential expression analysis of genes and isoforms

To identify tissue-specific genes and isoforms, we used the edgeR R package (version 3.32.1)48, applying a criterion of logFC >2 and p-value < 0.001. Enrichment analysis of these tissue-specific genes and isoforms, utilizing GO and KEGG annotations, was conducted with the clusterProfiler R package (version 3.18.1)49, referencing our updated functional annotation file.

Mfuzz cluster analysis

The tissue-specific isoforms and their corresponding gene’s CPM values for each sample were merged into a single expression matrix. Subsequently, we employed the Mfuzz R package (version 2.50.0)50 to cluster the expressed features into three clusters based on the sample time points (leaf, root, and stem). Based on the cluster results, we identified isoforms and their corresponding genes that exhibited inconsistent clustering.

Generation of the isoforms structure view

Isoforms of interest were selected and imported as transcript annotation files into ggtranscript R package (version 0.99.3)51 to visualize their splicing structures.

Statistics and reproducibility

This study collected two samples from Lactuca sativa L., creating two replicate samples each of roots, leaves, and stems for long-read RNA sequencing. No statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment. All the statistical details for the long-read RNA sequencing analysis can be found in the figure legends as well as in the Method section.