Introduction

Vaccinia virus (VACV) the prototypic member of poxviruses, contains a large, double-stranded DNA molecule encoding more than 200 protein-coding genes.

Long-read sequencing (LRS) has now become a mainstream approach in transcriptome studies. LRS is superior over short-read sequencing (SRS) in several respects, including their capability to detect full-length transcripts, which allows the efficient identification of long multicistronic RNA molecules, alternatively transcribed and processed transcripts, and transcriptional overlaps. Currently, two LRS approaches are available from Pacific Biosciences (PacBio) and from Oxford Nanopore Technologies (ONT). Our research group applied both approaches for the study of the transcriptomes of various viruses, including herpesviruses [1,2,3,4,5,6,7], baculoviruses, [8], circoviruses [9], retroviruses [10], asfariviruses [11], and poxviruses [12]. These investigations identified a large number of novel transcripts, transcripts isoforms, polycistronic RNA molecules, and transcriptional overlaps.

The effect of viral infection on the host cell transcriptome has been studied in various organisms using SRS [13] and LRS [14]. In this study, we report the time-course analysis of the effect of VACV infection on the transcriptome of African green monkey cells. This work demonstrates the general utility of LRS for the quantitative temporal analysis of gene expression.

Main text

Materials and methods

Cells, viruses, infection

African green monkey (Chlorocebus sabaeus) kidney fibroblast cells (CV-1; from ATCC, USA) were used for the propagation of the Western Reserve strain of VACV. Cells were incubated at 37 °C for 1, 2, 3, 4, 6, 8, and 12 h in a humidified atmosphere containing 5% CO2. Following incubation, media were removed and the cells were rinsed with serum-free RPMI 1640 medium and were subjected to three cycles of freeze–thawing.

RNA purification

Total RNA was extracted from the viral-infected cells using Macherey–Nagel RNA kit (Düren, Germany) according to the manufacturer’s recommendations. Polyadenylated fractions were isolated from total RNAs using Oligotex mRNA Mini Kit (Qiagen, Hilden, Germany) following the Spin-Column Protocol [1,2,3,4,5,6,7,8,9,10,11,12].

Library preparation

The libraries for sequencing were generated from polyA(+) RNA fractions. For PacBio libraries, the SMARTer PCR cDNA Synthesis Kit (Clontech, Mountain View, California, United States) and the “PacBio Isoform Sequencing (Iso-Seq) using Clontech SMARTer PCR cDNA Synthesis Kit and No Size Selection protocol” were used. The ONT 1D strand-switching cDNA ligation protocol (SSE_9011_v108_revS_18Oct2016, Pacific Biosciences, Menlo Park, California, United States) was used to produce libraries for nanopore sequencing. The details were described in our earlier publications [12, 15, 16].

Data processing

Read of inserts (ROIs) from the Sequel raw reads were generated using SMRT Link5.0.1.9585. MinION base calling was performed using the Albacore software v.2.0.1 (Oxford Nanopore Technologies, Oxford, UK). The ONT’s Guppy v.3.6.0 (Oxford Nanopore Technologies, Oxford, UK) was also used for basecalling with the aim to validate the data.

Transcriptome profiling of CV-1 cells

The C. sabeus reference genome (GCF_000409795.2) was used for aligning of MinION reads. We excluded MAPQ = 0 and secondary and supplementary alignments from downstream analyses. Primary alignments mapping to the VACV genome were counted separately. Reads that were aligned to the host genome were associated with host genes according to the C. sabeus_1.1_top_level.gff3 genome coordinates. Only reads that matched the exon structure of reference genes (using a ± 5-bp window for matching exon start and end positions) were counted. Gene counts were normalized to total valid read counts that were mapped to the host genome to identify abundant housekeeping genes that were not influenced by virus–host interactions during the experiment. Geometric means were calculated for the top 16 housekeeping genes (Table 1). A coefficient of variation of < 0.2 was used to normalize gene counts. Our criteria for statistical analyses of gene expression included > 60 normalized total gene counts for the six-time-point data. To classify genes by their expression profiles in the examined time series, we transformed normalized gene counts to a relative scale where the highest expression time point had a value of 1.0 (100%). We performed clustering using the k-means algorithm of the basic statistics package of R (v.3.5) (https://cran.r-project.org/bin/windows/base/old/3.5.0/). The optimal number of clusters was determined using Calinski-Harabasz criteria [17] with the cascade KM algorithm of the vegan (v.2.5-4) R package (Additional file 1: Fig. S1). Clusters were visualized using the heatmap (v.1.0.2) R package. Because few host gene reads failed to enable characterization of host gene expression at the later time points (8–12 h p.i.), host gene expression was only analyzed until 6 h p.i. We clustered the genes into five subcategories according to normalized gene expression profiles. We identified four categories for which gene expression profiles changed during the experiment (Additional file 1: Fig. S2). However, because clustering forces genes into categories, we identified the subset of genes with the most typical expression profiles for each clusters. Thus, in each cluster we plotted mean expression levels for different sampling times using ggplot2 (v.3.1.0, stat_smooth algorithm using LOESS method).

Table 1 List of the highly expressed housekeeping genes

The scores for each gene were calculated on the basis of expression profiles to represent the alteration of expression level between sampling points that showed the greatest difference in every cluster. Based on this scores, we identified the most characteristic genes in all clusters falling within the range between the top score and the top score–1 SD (Additional file 5: Table S1). Using the identified subset of genes, we performed overrepresentation analyses for the most characteristic genes in each cluster using 768 highly expressed genes as references with the PANTHER (v.14.1 using the 2018_04 dataset release) [18] software tool. We analyzed Gene Ontology biological processes with false discovery rates of < 1.

Results

In this work, the time-varying transcriptome of CV-1 cells were profiled applying LRS datasets. We used Albacore for basecalling. The Guppy basecaller was also run: we found almost perfect match between the results generated by the two toolkits [15]. The obtained datasets were used to determine the VACV transcripts [15] using the LoRTIA pipeline that was developed by our group [19]. Herein, we used a workflow and a pipeline for transcriptome profiling of LRS datasets that were also developed in our laboratory [4, 6, 20].

MinION sequencing yielded 964,775 host reads (average mapped read length: 583 nts). Sequel sequencing generated 439,330 host ROIs (average mapped read length: 1368 nts). The PacBio MagBead loading protocol selects fragments of less than 1 kb [21]. Dynamic gene expression profiles were classified by transforming normalized gene counts to a relative scale where the highest expression time point had a value of 1.0 (100%).

‘Static’ host transcriptome

Using the LoRTIA toolkit, we annotated a total of 478 transcription start sites (TSSs), 2011 transcription end sites (TESs), and 24,574 splice junctions, each represented by at least 10 reads (Additional files 6, 7: Tables S2, S3). Analyses of sequence regions upstream of TSSs revealed 43 canonical CAATT boxes (mean distance: 104.913 nt; standard deviation (sd): 15.306), 880 canonical GC boxes (mean distance: 60.095 nt; sd: 33.374), and 80 canonical TATA boxes (mean distance: 31.13 nt; sd: 2.966). It was demonstrated that initiator elements surrounding TSSs of human genes are more likely to be present if the transcript lacks a TATA box upstream of its TSS [22]. In our analyses, the BBCABW (B = C/G/T, W = A/T) initiator consensus was more pronounced around TSSs of TATA-less genes than around those with TATA boxes upstream of their TSSs (Fig. 1).

Fig. 1
figure 1

The distance of the host promoter elements from the TSS or the transcription initiator sequence. a–c Letter ‘M’ following the sample name indicates MinION sequencing, whereas letter ‘S’ indicates Sequel sequencing. The horizontal lines in the box plots represent the median distance of the given sample. d The transcriptional initiator region carries a BBCABW (B = C/G/T, W = A/T) initiator consensus sequence when no TATA box is present upstream of the TSS, whereas the consensus is missing in TSSs with a TATA box

Specifically, 1,849 TESs contain canonical upstream poly(A) signals (average: 26.681 nt; sd: 9.340), whereas ± 50-nt regions contain U-rich upstream and G/U-rich downstream elements (Additional file 2: Fig. S2). A total of 12,287 introns were annotated: 12,215 of these contained canonical GT/AG, 65 had GC/AG, 7 had AT/AC splice junctions. Those LoRTIA transcripts were accepted that were identified from at least two techniques and in three samples. The average transcript length was 693.661 nt (sd: 962.62), no significant deviations were observed during infection. Lengths of 5’-UTRs deviated around a mean of 52.956 nt (sd: 75.903), whereas the 3’-UTRs had a mean length of 295.219 (sd: 431.480) (Additional file 3: Fig. S3). Transcripts represented by less than ten reads were excluded. These assessments revealed a total of 758 transcript isoforms, 207 with TSSs and TESs in ± 10-nt intervals of previously annotated transcripts, 692 length isoforms, and 66 alternatively spliced isoforms. In total, 239 mRNA length isoforms differed from previously annotated transcripts in their TSS positions [including those with TSSs downstream of respective translation initiation sites (TISs)], 19 in their TES positions, and 56 in both. Altogether 31 transcript isoforms were found with TSSs downstream of previously annotated TISs, and contained curtailed forms of open reading frames (ORFs). These RNAs might code for N-terminally truncated forms of canonical proteins. A total of 177 transcripts were annotated as non-coding. The present non-coding RNAs include either isoforms of previously annotated ribosomal RNAs, or are truncated mRNAs lacking ORFs (Additional file 8: Table S4.)

Temporal response of the host transcription to VACV infection

The analysis of VACV infection on host transcription revealed relatively few differentially expressed genes [23, 24]. A recent proteomic study also showed that VACV infection affects very few host genes [25]. We categorized five distinct clusters of 768 highly expressed host genes with respect to their responses to viral infection. Among early up genes, no or very low expression levels were observed before virus infection, but consistently high expression was observed at all later sampling points. Early down transcripts were highly expressed before virus expression and were then constantly absent or had low expression levels at later time points. Early up/down transcripts had no or low expression before virus infection, high expression from 1 h p.i. and no or low expression at later sampling points. Mid up transcripts had no expression before virus infection and peaked and plateaued at 2 or 3 h p.i. Constant transcripts had no significant changes in relative expression levels over the course of our experiments (Fig. 2). We assessed expression patterns of the best characterized gene clusters using GO (Additional file 9: Table S5), and found significant overrepresentation of the GO process “regulation of signaling receptor activity” by genes that were highly expressed in the early stages but were not or only slightly expressed in the late infection phases. Most of the present clusters were not significantly enriched in genes of specific biological processes, although many genes that were upregulated during viral infection were found to play roles in cell division or in “positive regulation of viral life cycle”. Moreover, some genes that were downregulated upon viral infection were annotated to “cell growth” and “mesenchymal differentiation” categories.

Fig. 2
figure 2

Expression changes of the highly abundant host genes during viral infection. a The expression pattern of the host gene clusters. b Heatmap representation of the five distinct host gene clusters

Discussion

We employed LRS platforms for the kinetic analysis of African green monkey transcripts upon VACV infection. Sequencing data were annotated using the LoRTIA pipeline. To avoid annotation of non-specific reads as transcripts, we applied additional filtering of the obtained data using stringent criteria [15], but a certain fraction of excluded reads may represent true transcripts. Our analyses revealed 758 host transcript isoforms. Although partially degraded RNAs are captured by LRS [26], we observed relatively constant mean mapped read lengths from host RNA before and during viral infection. We identified a subset of genes with distinct gene expression profiles during viral infection. Although, only a single class of genes showed significant over representation in the „regulation of signaling receptor activity” GO biological process, we note that this analysis was limited to the study of the most abundantly expressed 768 genes.

In sum, this work identified a large number of host RNAs and transcript isoforms, and revealed a significant overrepresentation of genes in the “regulation of signaling receptor activity” GO annotation in virally infected cells. This study also demonstrates the value of LRS in time-course characterization of the transcriptomes in any organism.

Limitations

While this work could not cover the full gene sets of the host’s molecular pathways, we think that the expression changes of the identified genes nevertheless are likely the result of virus infection or related to the host’s response to infection. The limitation of this study is the relative low data coverage obtained by both sequencing methods.