Background

The earliest confirmed case of the severe acute respiratory syndrome (SARS) occurred in November, 2002 in the Guangdong province of China. Toward the end of the epidemic (as reported by July 31, 2003) there were 8,098 recognized cases in 31 countries or regions worldwide and 774 implicated deaths (WHO, http://www.who.int/csr/sars/country/table2003_09_23/en/). Due to an unprecedented international effort, the SARS coronavirus (SARS-CoV) was identified as the causal agent in late March 2003 and its first complete genomic sequences were published April 13, 2003 [1, 2]. One month later, SARS-like coronaviruses were found in palm civets and other animals in Guangdong, China, the first evidence of possible interspecies transmission of the virus [3]. The re-emergence of the isolated SARS cases in Asia in December, 2003 and in Anhui province and Beijing, China, in late April 2004, has confirmed a wide-spread conjecture that the SARS-CoV will likely be with humans for years to come. This re-emergence of SARS cases makes it legitimate to critically re-evaluate the time for the origin of the SARS-CoV.

There are 26 putative coding regions which cover about 98% of the 29.8-kb SARS-CoV genome. Approximately two-thirds of the genome are at the 5' side encoding the nonstructural proteins (orf1ab and orf1a) and one-third are at the 3' side encoding four structural proteins: spike glycoprotein (S), envelope (E), membrane (M), and nucleocapsid (N) [4]. The spike glycoprotein, especially its S1 subdomain, is responsible for binding to the specific receptor in the target cells [4, 5]. RNA polymerase and nsp1 genes are two major loci in orf1ab.

Estimating the mutation rate in RNA viruses and retroviruses is critical but also challenging for tracing their rapidly evolving paths. The rates estimated from the positive-strand ssRNA virus appear to be in a similar range (e.g., ~10-3 per site per year) from the negative-strand ssRNA virus, although a direct comparison is not possible because the mutation rates could be estimated from different regions or genes [615]. The estimated mutation rates in coronavirus, which SARS-CoV phylogenetically links to, are moderate to high compared to the others in the category of ssRNA viruses. For example, it was estimated to be 0.3 – 0.6 × 10-2 per site per year in the infectious bronchitis virus in a previous study [8]. However, the estimated mutation rate appears to have a wider range in the retrovirus [1620]. More details are presented in the Discussion section.

How SARS-CoV evolves has important implications for both strategic planning in the prevention of SARS epidemics and development of a vaccine and antibodies. The mutation rate is among the most fundamental aspect of sequence evolution. If the pathogen evolves slowly, there will be a better chance for development of effective long lasting vaccines and successful treatment for patients from a particular geographic region will likely be effective for patients from other areas. On the other hand, if the pathogen (particularly the genes coding for major antigens) evolves rapidly, an effective strategy to prevent transmission of the SARS-CoV must be the top-priority, and an effective vaccine program may be problematic. The purpose of this study is to improve our understanding of the evolutionary mechanism in the SARS-CoV genome, and in particular to address the issues of the mutation rate and the time for the emergence of the SARS-CoV in the human population. We reported the estimated mutation rate in the SARS-CoV using the available complete genomic sequences whose clinical history either is certain or could be inferred.

Results

Mutation rate

The sources of the genomic sequences used in this study and the methods of estimating mutation rates are presented in the Methods section. The divergence time was inferred based on the information summarized in Figure 1. Table 1 shows the mutation rates estimated by three strategies. When the first strategy was used to adjust for sequencing errors and potential mutations in the cell culture, the mutation rate was estimated to be 0.80 – 2.38 × 10-3 nucleotide substitution per site per year using all the sequences not generated from mainland China, and 0.81 – 1.38 × 10-3 nucleotide substitution per site per year using the TOR2 and Urbani sequences only. When the second strategy was used, the mutation rate was estimated to be 0.74 – 1.62 × 10-3 nucleotide substitution per site per year, which is lower than that from using the first strategy. As expected, the mutation rate estimated using the third strategy was the lowest; 0.54 – 1.57 × 10-3 nucleotide substitution per site per year using the 11 sequences not generated from mainland China and 0.42 – 0.72 × 10-3 nucleotide substitution per site per year using the TOR2 and Urbani sequences only.

Figure 1
figure 1

Clinical relations and estimated range of the divergence time among 16 SARS-CoV isolates. This figure is adapted from Figure 5 in [4]. Solid arrows indicate the certain SARS coronavirus transmission route and dashed lines indicate the uncertain route. SINxxxx denotes an unavailable primary contact of the Singaporean index patient (SIN2500). The numbers indicate a range of the diverged time (days) between two isolates.

Table 1 Mutation rate (per site per year).

Substitution rate in the coding regions

For all samples, the proportion of non-synonymous substitutions per non-synonymous site (Ka) was 0.63 × 10-3 and the proportion of synonymous substitutions per synonymous site (Ks) was 0.65 × 10-3, leading to Ka/Ks being 0.97. This ratio was 0.79 in the nonstructural region and 1.37 in the structural region. In particular, the values of Ka/Ks were 1.98 for nsp1 and 0.85 for S.

Table 2 shows the rates of nucleotide substitution in the coding regions of sequences. The overall rates of non-synonymous and synonymous substitutions were 1.16 – 3.30 × 10-3 and 1.67 – 4.67 × 10-3 per site per year, respectively. The non-synonymous rate was higher in the three genes E, M, and N, suggesting some of those mutations might increase antigenicity, although the number of mutations used to calculate these rates was small.

Table 2 Substitution rates (× 10-3 per site per year) and Ka/Ks ratio in the coding regions.

Time for the origin of SARS-CoV

The mutation rate estimated earlier allowed us to estimate the age of the most recent common ancestor (MRCA) of the sample, which should be about the same or more recent than the time for the origin of SARS-CoV. The phylogeny reconstructed by the neighbor-joining method with mid-point rooting or by maximum parsimony is overall consistent with the epidemic (Additional file 1). All the sequences from mainland China clustered together and separated from the remaining sequences, including those clinically related to the index patient A. GZ01 was distantly separated from other sequences. Assuming the MRCA is the root of the phylogeny, the age of the MRCA is then the divergence time between GZ01 and other sequences. Using the mutation rates estimated above, it is found that the MRCA could be alive at a time between March 28 and November 29, 2002 (strategy 1), between February 22 and October 3, 2002 (strategy 2), and even earlier (strategy 3). The most critical implication of these analyses is that it is entirely plausible that the MRCA of the sample could be alive as early as the spring of 2002.

Discussion

Some uncertainties in the quality of the sequence data and incomplete information from patient histories are two limiting factors of this study. The world-wide race to understand this novel virus has provided an unprecedented set of complete genome sequences of a pathogen in an interval of a few weeks, but likely side-effects of this race might be an elevated error rate in the released sequences and generating errors during the analysis. Among the 129 sequence variations reported [4], many were generated randomly by the algorithms during the alignment of the multiple sequences, therefore these should be removed or adjusted. The concern above has led us to wait until all the sequences used in this study have been significantly revised by their generators and to manually adjust the multiple-sequence alignment. Still some errors were unavoidable partly due to the intrinsic error rate of sequencing technology. For example among 18 common variations, 9 could not be uniquely assigned to the internal branches of the phylogeny. This incongruence is likely partially due to sequence errors. The existence of sequence errors can also be inferred by examining the ratio of transitional versus transversional changes. If nucleotide substitution occurs randomly, there are two transversional substitutions on average for each transitional substitution, and the ratio of transition to transversion should be 0.5. However, transition is generally favored over transversion in many organisms. For example, the ratio is approximately 2 in the human genome [21, 22]. The ratio has not been discussed extensively in the RNA viruses; however, it appears to be higher than that in the mammalian genomes based on the two previous reports of 3.7 in the influenza A virus [23] and 5.0 in the Marburg virus [24]. In this study, 60 transitional substitutions and 54 transversional substitutions were observed among the 16 sequences, thus the ratio was 1.1. The ratio in five sequences from mainland China was 0.9, considerably smaller than 2.2 which was observed in the other eleven sequences. This suggests that sequences from mainland China may be more erroneous than the other sequences. On the other hand, the ratio was 0.9 for the singleton variations, which was much lower than the ratio of 3.5 for the non-singleton variants. This further indicates that singletons were more problematic.

Because of the unknown level of errors in the sequences, a conservative approach to estimating the mutation rate was taken. Three strategies were used to reduce the effect of sequence errors, one being more aggressive than the other two. The mutation rates estimated by the first two strategies were quite similar. In the third strategy, all the variants unique to a given isolate were excluded. Such a strategy is very conservative because the amount of singletons is expected to be large in a rapid expanding environment (see below). Therefore the mutation rate was placed in the range of 0.80 – 2.38 × 10-3 nucleotide substitution per site per year based on the 11 sequences used. This rate, along with the rate of synonymous substitutions estimated in this study, is close to that recently reported using another approach [25]. In comparison to other coronaviruses, this rate is lower than that in the mouse hepatitis virus, similar to that in the transmissible gastroenteritis virus, but higher than that in the infectious bronchitis virus (Table 3) [68]. The estimated mutation rate is at the same order of magnitude as in other RNA viruses, for example, 2.3 × 10-3 nucleotide substitution per site per year in the influenza A viruses [12, 13]. The estimated mutation rate in HIV appears to have a wide range [16, 17]. It is likely that the mutation rate in the SARS-CoV is not higher than that in HIV. Therefore, the SARS-CoV is not an unusual coronavirus or RNA virus in terms of its speed of nucleotide changes. One of the challenging tasks, therefore, is to find those variations which led to the SARS-CoV being unique from other RNA viruses, especially coronaviruses, and how those variations changed the functionality and helped to transmit it to humans.

Table 3 Mutation rate in viruses.

Nucleotide variation is distributed along the entire genome. Based on our alignment and the annotation in GenBank, 21 of the 26 open reading frames had the variations, including genes encoding polymerase, spike glycoprotein, envelope, membrane, and nucleocapsid protein. The estimated mutation rate suggests that approximately 2 to 6 new mutations will occur each month in a virus assuming the overall uniform mutation rate. However, the rate of the non-synonymous substitutions might vary during the course of the SARS-CoV evolution [25]. It was observed that there was an excess of mutations (and amino acid changes) in the external branches of the phylogeny of a large sample of the HA gene sequences of influenza A, which was partially caused by sampling bias [26]. From a population genetics standpoint, a large proportion of mutations should occur in the external branches when the infected hosts have rapidly increased. Therefore, one should not conclude that mutation rate is low because of a relatively small number of mutations in the internal branches [27]. Our analysis, even by a conservative estimation of mutation rate, indicates that the SARS-CoV population has already harbored a considerable amount of genetic diversity.

The emerging time of the SARS-CoV is of special importance in dissecting the origin of the virus as well as the dynamics of the epidemic. The time for the most recent common ancestor of the 16 isolates was estimated to be between February 2002 and November 2002. Although this is consistent with the date for the earliest known case of SARS and those estimated in other studies [25, 28], it also suggests that SARS-CoV could have been present longer than generally believed, that is, around November 2002. One possible scenario is that the SARS-CoV had already infected some people in the spring of 2002 but failed to cause epidemics; its spread was however suppressed in the summer (similar to the summer of 2003), and re-emerged around November to cause the epidemic in 2003. Given the current re-emergence of SARS cases, this scenario is becoming more likely. There were indeed some media reports of SARS-like symptoms of patients in the spring of 2002 although none have been convincingly confirmed. An alternative scenario is that the common ancestor of the SARS-CoV lived in the spring of 2002, but the host was animals. The recent finding of high sequence homology between the isolate from a newly emerged SARS case (December 16, 2003) and the isolates from the masked palm civets [29] makes civets as the primary suspect of reservoir for SARS-CoV.

Conclusions

The estimated mutation rate and the synonymous and non-synonymous substitution rates in the SARS-CoV genome were moderate compared to that in coronavirus and other RNA viruses, suggesting that the SARS-CoV is not an unusual coronavirus in terms of its speed of nucleotide or amino acid changes. Based on the mutation rates estimated in this study, the emerging time of the most recent common ancestor of the 16 isolates can be placed between February 2002 and November 2002. This suggests that the SARS-CoV could have been with humans as early as the spring of 2002 without causing a severe epidemic.

Methods

Sequence data

We obtained 16 complete genomic sequences from the NCBI website http://www.ncbi.nlm.nih.gov/. Among them, five sequences (BJ01-04 and GZ01) were obtained from the hosts collected in mainland China and the remaining sequences (TOR2, Urbani, CUHK-W1, CUHK-Su10, HKU-39849, five Singaporean sequences, and TW1) were from the hosts in other geographic regions. Detailed information of the sequences is shown in Table 4.

Table 4 Sources of 16 genomic sequences.

Sequence analysis

CLUSTAL X [30], a window-based user interface to the CLUSTAL W, was used to align the multiple sequences. The alignment was further manually examined and adjusted. All gene annotation information and nucleotide position designations in this study refer to the TOR2 sequence (GenBank accession ID: NC_004718). To avoid complications, only the single nucleotide variations were analyzed and all alignment gaps were excluded. This led to the identification of a total of 114 single nucleotide variations among all the sequences and an average of 18.2 nucleotide differences between two sequences.

The MEGA2 computer program [31] was used to calculate the pair-wise nucleotide differences. The resulting genetic distances were corrected by Jukes and Cantor's method [32]. The phylogeny of the sample was reconstructed using both neighbor-joining and maximum parsimony methods [31, 33].

Mutation rate can be estimated in principle by the number of nucleotide differences between two sequences divided by twice their divergent time, i.e., the time to their most recent common ancestor. Due to better documented contact histories, mutation rates were estimated only by the sequences whose hosts were not from mainland China, that is, sequences TOR2, Urbani, CUHK-W1, CUHK-Su10, HKU-39849, five Singaporean sequences, and TW1. First, the range of the divergence time between each pair of sequences was inferred based on information on infection history, reported strain isolation dates and sequence release dates (Additional file 2) [4, 3436]. For example, the divergence time between isolates TOR2 and Urbani was estimated to be in the range of 34 to 58 days [35, 36]. Second, nucleotide difference between each pair of sequences was calculated with adjustments to reduce the effect of sequencing errors and potential mutations during cell culture. Three strategies were used. The first strategy was used to reduce the number of pair-wise nucleotide differences by the averaged number of nucleotide differences observed in five closely related Singaporean sequences [4]. This strategy effectively assumes that there is no real nucleotide difference among these five sequences so that their observed differences reflect the level of errors. The second strategy was used to reduce the pair-wise nucleotide difference by two and to add 7 days to the divergence time to account for cell culture time. This strategy assumes that the mutation rate during the cell culture is the same as that in the human host and that on average the sequencing error is one nucleotide per genome. In the third strategy, we excluded all the nucleotide variants which had been observed only once (singletons) among the 61 human SARS-CoV sequences reported in [25]. The rational is that non-singleton mutations observed in a sample are much less likely due to sequencing errors as well as mutations during the laboratory passage of virus. This strategy is apparently conservative and can be regarded as the lower bound of the mutation rate. Finally, the mutation rate per site per year was estimated by

where d ij is the genetic distance between sequence i and j, t ij is twice their divergence time (in number of days), and n is the number of sequences.

A mutation in a codon is non-synonymous (or non-silent) if it changes the amino acid, and is synonymous (silent) otherwise. The number of non-synonymous mutations per non-synonymous site (Ka) and the number of synonymous mutations per synonymous site (Ks) were computed using the method of Li, Wu, and Luo [37]. The non-synonymous and synonymous substitution rates were calculated using the divergence time as estimated above. Only the second strategy was applied to the rate estimation because the number of nucleotide differences used for the adjustment in the first strategy can not be separated for the non-synonymous and synonymous mutations.