Introduction

SARS-CoV-2 is the cause of coronavirus disease 2019 (COVID-19). There have been over 256 million COVID-19 cases and over 5.1 million total deaths due to COVID-19 worldwide, at time of writing1. Genomic analyses of longitudinal specimens within infected persons are critical to understanding the evolutionary trajectory of SARS-CoV-2. Sequencing of longitudinal samples from infected individuals allows examination of viral genetic diversity, host immune response, and dynamics of co-infecting pathogens over the course of infection and recovery. Within-host variants arise during viral replication and a number of processes shape their frequencies over time. These include selective pressures at different scales (molecular, immunological, epidemiological), host heterogeneity, spatial structure, population bottlenecks, and other stochastic processes2. Within-host variants may impact the success of vaccines and therapeutics, and a fraction of variants that arise will be transmitted between hosts and can eventually reach fixation in the population3. Recent studies of within-host diversity of SARS-CoV-2 have demonstrated the presence of low levels of minor variants and infrequent emergence of escape mutations3,4,5,6,7. Of particular note, deletions in the N-terminal domain of the spike glycoprotein have been observed in chronically infected immunocompromised patients that are associated with SARS-CoV-2 escape from sera8,9,10,11, and are present in current circulating lineages of concern.

Here we examined longitudinal clinical specimens collected from 20 COVID-19-positive patients in Washington State. With metagenomic sequencing we identified changes in host gene expression and bacterial co-occurrences, which may be associated with recovery12,13. We found negligible change in viral consensus sequences over time, but detectable changes in variant allele frequencies that are only weakly predictive of future consensus changes across the globe. We further observed rapid emergence of deletion variants in the N-terminus domain of the spike glycoprotein in one patient, potentially suggesting within-host SARS-CoV-2 evasion of NTD-directed antibodies. Taken together our results support the limited emergence and fixation of escape variants during a typical infection, and also highlight the need to monitor minor variants due to their potential impact on vaccine and therapeutic efficacy.

Results

Viral load dynamics in longitudinal samples from SARS-CoV-2 infected individuals

Residual clinical specimens were obtained from the University of Washington (UW) Virology Lab after testing for SARS-CoV-214. By reviewing our laboratory information system, we identified 20 individuals who had two or more positive or inconclusive samples collected between March and September 2020 (Table 1, Supplementary Table S1). Inconclusive samples had one of two PCR targets detected. This commonly occurs in samples with small amounts of viral DNA, and thus these were considered presumptive positive for SARS-CoV-2, albeit with low viral load. A majority of samples came from inpatients who received care within the UW Medicine system, which includes UW Medical Center, Harborview Medical Center, and Northwest Hospital. Samples came from individuals with a mean age of 70 (range 42–99), many with severe disease given the availability of multiple samples from these patients. Consistent with other reports15, we observed that viral load declined over time in most patients with two or more positive or inconclusive samples with an average increase in RT-QPCR cycle threshold (Ct) of 0.87 per day (Fig. 1, Supplementary Fig. S1).

Table 1 Demographics and clinical characteristics of patients included in study.
Figure 1
figure 1

Viral load dynamics in sequenced samples. Dots represent a unique sequenced sample. Lines connect samples from a single patient. Same day samples are not shown (see Supplementary Fig. S1).

A total of 49 samples (47 nasopharyngeal and 2 oropharyngeal swabs) were sequenced with sufficient reads to be included in this study (Supplementary Fig. S2A). Ct values for these samples ranged between 14.6 and 36.6. The length of time between collection dates for sequenced samples from the same individual ranged from 0 to 22 days. Samples collected on the same date were sequenced for three individuals. We sequenced nasopharyngeal and oropharyngeal samples from P004 collected at the time of autopsy, two samples from P006 collected at the same time during a hospital admission, and samples collected 9 h apart from P012 during an emergency room visit.

Negligible change in consensus sequences over time

We obtained full-length viral genome sequences with less than 2% unknown bases (Ns) for two or more time points in 14 out of 20 individuals, plus an additional two individuals with paired nasopharyngeal and oropharyngeal swabs collected at the same timepoint (Table 2, n = 38 sequences). After masking ambiguous sites and regions with sequencing or assembly errors, we found no differences between the first and subsequent consensus sequences in 15 out of 16 patients. In one patient (P001), two samples collected 3 days apart had 4 differences between their consensus sequences at reference positions 15,418 (G/T), 26,262 (G/T), 27,899 (T/A), and 27,944 (T/C). Two of these differences lead to coding changes (A660S in nsp12 and Q2K in ORF8). Variant alleles were observed for all four positions at low frequencies in the sample collected at the earlier timepoint. These mutations are not present in any currently characterized variants of concern.

Table 2 Consensus sequence analysis of SARS-CoV-2 in longitudinal specimens.

Low frequency variants detected across the genome

We analyzed intrahost viral genetic variation by examining all sites with > 100 × locus depth, masking known problematic sites and filtering for high-confidence variants (see Methods). We examined sites in 47 samples from 20 different patients and found a total of 103 unique non-synonymous variants relative to the Wuhan-Hu-1 (NC_045512.2) reference genome present at frequencies between 5 and 95% (Fig. 2A). nsp12 had the highest number of variant sites (26, 25.24%), followed by nsp3 (16, 15.533%) and the spike glycoprotein (13, 12.62%). Even when adjusted for gene length, mutations were most prevalent in nsp12, at 0.009 variant sites per nucleotide. For samples that were sequenced multiple times (n = 10), variant frequencies were reproducible across replicates, particularly among samples sequenced with the same library preparation method (Fig. 2B). Intra-host mutations present in multiple longitudinal samples of the same patient were mostly fixed and highly clonal, with some low-level variants (Supplementary Fig. S2B).

Figure 2
figure 2

Low frequency variation is abundant but only a small number of variants exhibit a significant change in allele frequency over the course of infection. (A) Each dot represents a high-confidence coding change in a single sample relative to the Wuhan-Hu-1 (NC_045512.2) reference genome with variant allele frequency between 5 and 95%, at least ×100 coverage at the site, and reproducibility in multiple samples at lower frequencies (< 40%). Color scale represents the change in allele frequency across time points in the same patient with darker colors representing variants that had greater changes in frequency across samples. Small dark grey marks along the top margin shows positions with variant frequencies > 95% (fixed mutations relative to the reference). Size of circles indicates sequencing depth at the site. Marginal histogram shows distribution of variants using bin width of 500 nucleotides. (B) Comparison of allele frequencies of low-frequency variants (< 20%) across replicates of the same sample (n = 10 samples). Each dot represents a variant with ≥ 100 total depth and ≥ 10 allelic depth in each replicate. Line of best fit is shown in purple, and dots in orange represent replicates that were re-sequenced using a different library preparation method (amplicon sequencing vs. shotgun metagenomic sequencing).

Of the seven most commonly observed variants in our dataset (Table 3), the three most frequent define the Washington state outbreak clade16,17 and the rest of the variants are clade-defining mutations in Nextstrain clades 20A and 20C. Nine out of 20 patients had the spike protein mutation D614G (A23403G), which has been associated with increased transmissibility and higher viral loads18,19. While this variant was rare at the beginning of the pandemic, it reached near fixation in the global SARS-CoV-2 population by June 202020. This rapid rise in prevalence is reflected in our data, as this D614G mutation is present in all three patients with samples collected during or after June 2020. In 10 out of the 11 patients with the 614D variant, no alternate alleles were detected at this position. In the second sample from P007, 614G was detected with a variant allele frequency of 6.1%, but the read depth at this locus (82X) was insufficient to reach our QC standards.

Table 3 Frequent non-synonymous variants observed in ≥ 15 samples (n = 47).

Variants exhibiting intra-host evolution are limited in prediction of future global consensus changes and highlight SARS-CoV-2 antibody evasion

We further examined variants that underwent a maximum allele frequency change of ≥ 20% across timepoints within each patient. The derived alleles for all 25 non-synonymous amino acid changes meeting this criteria in our dataset were also observed among consensus genomes deposited in GISAID21 by April 2021 (range: 1–2171, mean: 467.8, median: 86.5). Only 8 variants exhibited a maximum allele frequency change of ≥ 40%. The derived alleles for these variants were present in very few GISAID consensus sequences (range: 1–1751), which at the time of analysis represented a mere 0.0001–0.17% of all sequences deposited in GISAID (Fig. 3A). Though the number of consensus sequences with these derived alleles was relatively low, these sequences were diverse with respect to collection date and geographic origin (Fig. 3B).

Figure 3
figure 3

Variants that exhibit ≥ 40% maximum change in allele frequency in the individuals profiled here in summer 2020 show limited ability to predict future GISAID consensus sequences as of April 2021. (A) Relative frequencies of the derived allele found in GISAID consensuses across the genome. Dots represent each unique variant with size indicating the maximum intra-host change in allele frequency found in our study. (B) Number of GISAID consensuses with the derived allele for each variant. Height of vertical bars represents the total number of consensuses with the derived allele collected for each month from March 2020 to March 2021 and bar color represents the number of continents of origin for these consensuses.

Three variants with a ≥ 20% within-host change in allele frequency localized to the spike glycoprotein (Fig. 4A). None of these had an allele frequency change of ≥ 50%. In patient P016, a S:143–145 6-nucleotide deletion was observed at an allele frequency of 20.7% in patient sample 2, but was not observed in samples collected 3 days prior and 3 days later. This patient was immunocompetent and not receiving any COVID-19 treatment. Interestingly, numerous other deletions arose at low frequencies in this patient, with the largest number present at day 0 (Fig. 4B). The most prevalent deletion variant in the day 0 sample (collected 16 days after initial symptom onset) was S:∆141–144 at 1.87% allele frequency. This deletion was the second most common variant in the day 3 sample at 1.2% allele frequency, but was not observed at all in the day 6 sample. Deletions in this region, including S:∆141–144, have previously been observed in chronically infected immunocompromised patients, and some are associated with escape from NTD-specific neutralizing antibodies or polyclonal sera8,9,10,11,22. Notably, a S:∆142–144 deletion is a hallmark of the currently circulating Omicron variant.

Figure 4
figure 4

Variants that exhibit intra-host evolution in the spike protein across all patients. (A) All non-synonymous variants located in the spike protein with a ≥ 20% change in allele frequency among timepoints for any patient. (B) Enumeration of deletions that arose between residues 138–149 of the spike protein in P016 at ≥ 1% relative frequency reveals a rapidly changing complement of low frequency alleles present over a 6-day period. The reference nucleotide sequence (NC_045512) is located at the top of the sequence alignment. Above the reference is the corresponding amino acid sequence with associated residue numbers. Alleles that match the reference are in gray, and deletions are shown in black. To the right of the sequence alignment is a bar graph showing the square root of the relative frequency of each variant, for visualization purposes, labeled with the allele frequency in percentage and read depth in parentheses.

Longitudinal RNAseq analysis illustrates loss of ciliated epithelium during infection

For samples that were sequenced metagenomically, we pseudo-aligned reads to the human transcriptome to perform differential expression analysis comparing initial (t = 0) timepoints to later timepoints. Samples with more than 900,000 pseudo-aligned reads (n = 7 initial, 3 later timepoints) were included in the analysis to determine variation in host gene expression over time. We observed a dramatic downregulation of several cytoskeletal genes, particularly dynein heavy chain (DNAH 2, 3, 5, 6, 7, 9, 10, 11, 12), as well as WDRs, MAP1A, and others (Fig. 5A, Supplementary Table S3). Gene Ontology analysis (Fig. 5B) confirmed that downregulated genes are involved in biological processes associated with microtubule-based motility. This is consistent with the death of ciliated epithelial cells, which are enriched for transcripts encoding microtubule transport machinery23, following SARS-CoV-2 infection. We observed upregulation of some actin cytoskeleton-related transcripts like VAV1, VASP, and RhoF24. We also observed downregulation of several interferon-stimulated genes, but these did not reach statistical significance in this small sample set.

Figure 5
figure 5

Differentially expressed genes during SARS-CoV-2 infection. (A) Twenty differentially expressed genes with lowest adjusted p value. Fold changes are of later samples relative to initial samples. Genes highlighted in red have a log2 fold change > 2 and an adjusted p value < 0.1. (B) Gene Ontology analysis reveals that differentially expressed genes are significantly enriched in biological processes related to microtubule-based motility. The twenty biological processes with the lowest adjusted p values are shown. The length of the horizontal bars corresponds to the number of DE genes in each GO category (“Number Enriched). Bar color corresponds to the adjusted p value for enrichment of DE genes in each pathway.

Metagenomic analysis shows high levels of clinically relevant bacteria in three samples

We used a previously described metagenomic pipeline (CLOMP25) to perform taxonomic assignment of preprocessed reads. We excluded samples that had fewer than 10,000 reads after trimming and any samples that underwent enrichment via probe-capture or amplicon-sequencing for SARS-CoV-2. A total of 24 samples from 11 patients were included in further analysis (Supplementary Fig. S3, Supplementary Fig. S4). No viruses aside from SARS-related coronaviruses met the required cutoffs to be classified as co-infections (see “Methods”). Multiple samples had detectable numbers of bacterial reads, in particular Staphylococcus aureus and Moraxella catarrhalis. Samples from two patients (P004, P005) had high detectable levels (> 100 RPM) of both bacteria at one or more time point(s) (Supplementary Fig. S3B). In patient P006, who had paired nasopharyngeal (NP) and oropharyngeal (OP) swabs collected at the same time point, we detected Capnocytophaga gingivalis, Capnocytophaga leadbetteri, and Streptococcus parasanguinis in the OP swab at 21,960, 9475, and 4476 RPM, respectively. All three species of bacteria commonly colonize the oropharynx. In contrast, in the NP swab, the predominant species of bacteria was a common skin colonizer, Cutibacterium acnes, at 824 RPM. In P010, we found a large number of reads corresponding to Corynebacterium spp. at one time point. Upon review of medical records, we found no mention of bacterial co-infections or of positive bacterial cultures in the charts for P004 or P006. P005 had a nares culture that grew methicillin-resistant Staphylococcus aureus three months prior to SARS-CoV-2 infection.

Discussion

In this study, we performed high-throughput sequencing of longitudinal clinical specimens that were positive for SARS-CoV-2 by RT-QPCR. Most samples were sequenced using a metagenomic approach, which enabled us to simultaneously derive information about viral evolution, host transcription, and the presence of other organisms within patient samples.

We showed that although the viral consensus sequence remains largely unchanged over the course of infection, there is a relative abundance of genome-wide low-frequency variants. Similar to other studies, we saw a wide range in the number of variants detected across samples26 and distribution of variants across the genome, though some positions appeared to be more prone to variation 4,26. Studies of SARS-CoV-2 and other respiratory viruses2,3,8,27 have demonstrated the transmission of minor variants and the role of these population bottlenecks on viral evolution, underscoring the importance of studying within-host viral variation. All variants demonstrating significant longitudinal evolution in our sample set collected March-September 2020 have been observed in consensus sequences from around the globe21, albeit at relatively low prevalence.

In one patient, we observed rapid turnover of multiple deletion variants in the N-terminal domain of the spike glycoprotein, which has previously been seen in persistent infection in immunocompromised individuals and has been associated with viral escape of neutralizing antibodies8,10,22,28. Deletions in the NTD are of particular significance due to their presence in currently circulating lineages of concern. Here we show the emergence of a deletion in this genomic region in an immunocompetent background. It is unclear if the absence of this mutation at day 6 is due to successful clearance of the NTD variant or lack of detection of the minor allele associated with lower copy numbers. In addition, while some of these low frequency deletion alleles have been previously shown to arise independently in different patients in response to similar selection pressures8, the presence of multiple low frequency deletion alleles within the same patient may be the product of parallel within-host microevolutionary processes suggestive of some selective advantage, particularly with its similarity to the S:∆142–144 deletion in the currently circulating Omicron variant. Notably, we did not find any evolving variants selected for in the RBD, the main target for neutralizing activity of human plasma 29,30. Additionally, we found two highly dynamic mutations within the Ubl2 and PL2Pro domains of the multifunctional nsp3 protein31 and three mutations within the nsp12 polymerase. While we and others have previously linked a specific nsp12 mutation to remdesivir resistance32, each of these mutations has no phenotypic associations to date and awaits further biochemical and virological characterization.

Individual host factors, such as the immune response and respiratory tract microbiome, may play an important role in viral persistence. In particular, because SARS-CoV-2 infection is slow to resolve, the adaptive immune response could drive within-host viral evolution as variants that can escape T-cell and antibody responses develop. Although we were underpowered to see specific evidence of an adaptive immune response being mounted against SARS-CoV-2 in the nasopharynx, detailed studies evaluating antibody and T-cell receptor repertoire changes throughout the course of infection could shed light on the role of immune pressure in the development of minor variants. Similarly, the relationship between bacterial colonization of the nasopharynx and the development or suppression of inflammation in response to SARS-CoV-2 infection remains poorly understood.

As viral load decreases during recovery, it becomes more challenging to recover viral genomes. As a result, one of the limitations of our study is the variability in sequencing depth across samples and the difficulty in ensuring similar sequencing depth for samples from different time points. We used an amplicon sequencing-based approach described previously33 to obtain near full-length genomes from low viral load samples (up to Ct values of 36). We also used multiple library preparations and performed re-sequencing to ensure the accuracy of variant calls.

Taken together, our results suggest that low frequency genomic variants emerge in immunocompetent individuals, but that these variants are unlikely to reach fixation. Given the emergence of rapidly spreading variants of concern over the past several months, the limited intra-host evolution observed in our dataset highlights the critical impact that a select few individual intra-host evolutionary events may have on the course of the global pandemic and the need for continual genomic surveillance.

Methods

Sample collection and clinical testing for SARS-CoV-2

Specimens were obtained as part of clinical testing for SARS-CoV-2 ordered by local healthcare providers or collected at drive-through testing sites. RNA was extracted and the presence of SARS-CoV-2 was detected by RT-QPCR as previously described using either the emergency use-authorized UW CDC-based laboratory-developed test, Hologic Panther Fusion or Roche cobas SARS-CoV-2 tests34,35. Supplemental Table 1 contains clinically relevant details of these specimens.

Chart review and ethics approval

We received approval from the University of Washington Institutional Review Board (UW IRB) to use residual clinical specimens for sequencing and to review clinical records of patients who received care within the UW network. We also received a waiver of informed consent from the UW IRB for the use of residual clinical specimens and retrospective chart review to perform this work. Information obtained from medical records included sex, age, comorbidities, medication, hospital or critical care admission, and discharge status. All methods were carried out in accordance with relevant guidelines and regulations. 18 of the 20 total patients had one or more known comorbidities (Supplementary Table 1).

Sequencing and bioinformatic analysis

Sequencing was attempted on all samples with a positive RT-QPCR assay result that had a Ct ≤ 36 using one of three methods available at the time: (1) a shotgun metagenomic approach using Illumina Nextera XT described previously36 for samples with a Ct less than 24; (2) an oligo-nucleotide probe capture-based approach from IDT (xGen NGS hybridization capture, Integrated DNA Technologies) similar to previous work37 for samples with Ct between 24 and 28, or (3) using Swift Biosciences’ Normalase Amplicon Panel library preparation workflow38 for samples with Ct between 28 and 36. Libraries were sequenced on Illumina MiSeq, NextSeq, or NovaSeq instruments using 300, 150, 100, or 75 bp reads. Consensus sequences were assembled using TAYLOR38, a custom bioinformatics pipeline (https://github.com/greninger-lab/covid_swift_pipeline) with or without an additional primer clipping step depending on library preparation method.

Consensus sequences from each individual were aligned with the reference sequence NC_045512 using MAFFT v739. Clade assignments were generated using Pangolin (http://github.com/cov-lineages/pangolin) and Nextstrain40 in December 2020. Consensus sequences with < 5% Ns across the length of the genome were considered for further analysis.

Variants were also called with TAYLOR from aligned reads. Variants leading to coding changes with a sequencing depth of > 100 and an allele frequency > 0.01 were subjected to further analysis. For variants that had an allele frequency of < 0.4, we included an additional quality-control step to only include mutations that were present across multiple samples. We also excluded mutations in the first 100 and last 50 bases, as well as variants determined to be due to sequencing error. Most samples were re-prepped and sequenced multiple times to ensure accuracy of variant calls. Variants at positions 6700, 11081-83, 19989, and 29056 were observed in a large number of samples but were determined to be the result of homopolymer sequencing error and were excluded.

RNAseq analysis

For samples that were prepared with no target enrichment step (metagenomic sequencing), reads were adapter and quality trimmed with Trimmomatic v0.3941 using the call “leading 3 trailing 3 slidingwindow:4:15 minlen 20”, then pseudoaligned to the hg38-derived human transcriptome using Kallisto v0.4642. Only samples with more than 900,000 reads pseudo-aligned to the human genome were used for analysis. Differential expression analysis using the Wald test was performed using DEseq243 and deemed significant at a Benjamini–Hochberg adjusted p value < 0.1. Statistical enrichment of Gene Ontology Biological Processes was performed on all significant genes using the R package clusterProfiler44. Raw counts were submitted to the Gene Expression Omnibus, accession number GSE173310.

Metagenomic analysis

Raw FASTQ files were analyzed using CLOMP v0.1.4 (https://github.com/FredHutch/CLOMP) as previously described25. Samples with more than 10 million reads were randomly down-sampled to 10 million reads before analysis using the “sample” command in seqtk (https://github.com/lh3/seqtk). The pipeline output was visualized using the Pavian metagenomic explorer45, and reads per million (RPM) calculations were done using a custom R script. Results were filtered to highlight RPM counts for a shortlist of clinically relevant taxa (Supplementary Table S4). Samples were determined to be positive if the species level RPM was at least 30 for viruses, and 100 for bacteria.