Background
Since the emergence in late 2019, the novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has spread globally and caused the coronavirus disease 2019 (COVID-19) pandemic with very serious medical sequelae for many individuals [
1,
2]. Rapid development and introduction of vaccines against SARS-CoV-2 within a year, by late 2020, has contributed to significantly lowering the transmission, severity of illness and mortality from COVID-19 [
3]. Despite this unprecedented effort, SARS-CoV-2 continued to mutate and evolve resulting in the emergence of new strains and resulted in multiple waves of new infections across the globe [
4]. Specifically, the World Health Organization (WHO) designated 5 strains—Alpha, Beta, Gamma, Delta, and Omicron as variants of concern (VOC) due to either increase in transmission, virulence, or escape from existing diagnostics, vaccine immunity, and therapeutics [
5‐
7]. The continued emergence of SARS-CoV-2 strains with higher fitness suggests the need for continued monitoring of viral evolution and identification of features contributing to viral fitness.
Evaluation of nucleotide composition and synonymous codon usage bias of viral genomes is useful to understand viral genetic changes, immune evasion and its adaptation to host [
8]. Degeneracy of genetic code, referring to the availability of 64 possible codons to code for 20 amino acids, allows a majority of amino acids to be encoded by more than one codon. These synonymous codons of an amino acid are not used uniformly and usage of some codons is preferred over others, a phenomenon known as codon usage bias (CUB) [
9]. CUB is primarily shaped by mutational and translational pressures which correlate with multiple factors including translation efficiency/fidelity (abundance of tRNAs), gene expression, location within genes, nucleotide composition, and mRNA/protein structure [
10,
11]. Relative synonymous codon usage (RSCU) [
12] and codon adaptation index (CAI) [
13] are a couple of metrics commonly used to measure CUB [
14,
15]. Scaled ratio of nonsynonymous to synonymous variants, referred to as dN/dS, is often calculated in order to understand the direction and magnitude of selection at the molecular level [
16]. However, interpretation of results from such analysis assumes that synonymous variants are neutral. A growing body of literature in the last 2 decades demonstrated that synonymous variants could affect expression and quality attributes of the encoded protein by multiple mechanisms at both transcriptional and translational levels through their effects on pre-mRNA splicing, mRNA structure/stability, miRNA binding, translation efficiency, and co-translational folding [
17]. Additionally, RNA structures play an important role in the translation and replication of RNA viruses, including SARS-CoV-2, and a selection pressure against synonymous variants in these functional regions was reported [
18]. Therefore, synonymous variants cannot be assumed neutral and careful interpretations of results is needed [
19].
SARS-CoV-2 has a positive sense single strand RNA genome of ~ 30 kb in length [
20,
21]. About 2/3 of the genome at 5′ end comprises of two overlapping reading frames ORF1a and ORF1ab. The distal 1/3 of the genome encodes for 4 structural proteins: Spike (S), Envelope (E), Membrane (M), Nucleocapsid (N), and additional accessory proteins (ORF-3a, -3b, -6, -7a, -7b, -8, and -10). The Spike protein is a surface glycoprotein that is essential for viral attachment and entry in to host cells. M and E proteins are also located on the surface of virion with functional roles in viral assembly and pathogenesis. The N protein is located within the viral capsid and is involved in viral RNA packaging. Similar to other coronaviruses, SARS-CoV-2 has an AT rich genome (~ 62% AT and 38% GC content). T nucleotide is most used (~ 34%) followed by A (~ 28%), C (~ 20%), and G (18%) nucleotides. Subsequently, RSCU analysis showed that a majority of preferentially used codons in the SARS-CoV-2 genome are T or A ending while G and C ending codons predominated the less preferred ones [
22]. Over a period of time, a significant increase in T usage with simultaneous decrease in C usage, primarily through C—> T transitions mediated by the action of host cell APOBEC cytosine deaminases was reported [
23]. A minimal, albeit significant decrease in the CAI values of SARS-CoV-2 genomes over time was reported. This decrease over time was reported to be a result of lower CAI values of later emergent strains like Alpha and Delta rather than decrease within strains. Interestingly, the latest VOC, Omicron was reported to have CAI higher than other VOC, but lower than the original Wuhan-Hu-1 strain [
24].
Previous studies analyzed a limited number of sequences and strains over a short period of time. In this study, we sought to perform a variety of bioinformatic analyses on the largest possible cohort of SARS-CoV-2 sequences for the entire timespan of the COVID-19 pandemic. We analyzed 3,573,491 sequences downloaded from GISAID, a substantially larger cohort than comparable analyses [
24‐
27], by measuring nucleotide and codon frequencies, RSCU, and CAI over time separated by month and clade. Additionally, we analyzed dN/dS ratios for individual clades of SARS-CoV-2 and visualized the frequencies of synonymous and nonsynonymous mutations for Spike gene of SARS-CoV-2 Omicron and Delta clades, and analogous genes in Middle East Respiratory Syndrome Coronavirus (MERS-CoV), Influenza, and Dengue viruses. Altogether, these analyses showed changes in nucleotide and codon usage characteristics of SARS-CoV-2 genome over time and revealed its unique mutation pattern compared to some other RNA viruses.
Discussion
In this study, we analyzed the nucleotide and codon frequency data for 3,573,491 SARS-CoV-2 sequences, covering 12 genes (ORF1ab, Spike, ORF3a, ORF3b, E, M, ORF6, ORF7a, ORF7b, ORF8, N, and ORF10) and five GISAID specified clades (Alpha, Beta/Mu, Gamma/Lambda, Delta, and Omicron) over 32 months (December 2019–July 2022) [
28]. We discovered when analyzing each gene individually there is very minimal overall nucleotide and codon variation between and within clades over time regardless of the length of the gene. We subsequently showed that within this pool of few mutations, nonsynonymous Spike gene mutations overwhelmingly dominate the phenotypic landscape with relatively strong selection pushing for changes away from human codon usage trends/bias in most cases. Overall, our findings showed that as SARS-CoV-2 spreads around the world, small shifts in nucleotide and codon frequency are enough to produce a variety of clades expressing varying degrees of infectivity and virulence [
44].
Several previous studies also looked at nucleotide and codon usage of SARS-CoV-2, but used relatively small sample sizes, covered shorter date ranges, excluded clades, and/or analyzed concatenated genes or full sequences [
24‐
27,
45‐
49]. By analyzing the genes individually and expanding the number of sequences and dates, we found that most SARS-CoV-2 genes over time favored the T nucleotide followed by A, C and G. Roy et al. [
22] found similar results with 99 GISAID concatenated sequences downloaded in February 2020; though, slight variations in nucleotide frequency have been found [
23,
45,
47,
50], A/T bias remains the same, except within the N gene (A > C > G > T) (Fig.
2). Low GC content is found in all coronaviruses and has been shown to strengthen translation initiation through biochemical rigidity [
50,
51]. The G and/or C nucleotide content is on average suppressed in most clades for the genes analyzed in this study. At the same time, the T nucleotide content on average increased in ORF1ab, ORF3a, ORF3b, E, ORF7b, ORF8, and N genes (Additional file
5: Table S5). ORF6 has the biggest disparity between A/T and G/C frequencies, though this difference is decreasing over time for most clades. The N gene is unique in its nucleotide content; however, there is a gradual increase in the T nucleotide and decrease in the C nucleotide for all clades but Alpha. Several publications have shown evidence for hypermutation C—> T transitions and its possible relation to RNA editing processes via APOBEC proteins [
25,
48,
49]. This may be one explanation for the overall maintenance of low GC content of SARS-CoV-2 sequences, despite the large difference from human usage [
48].
As Fig.
2 illustrates, we saw very little nucleotide variation within and between clades for each gene. Nonetheless, many of these differences between clades were significant and aided in their divergence (see Methods; Additional file
2: Table S2 and Additional file
3: Table S3). We can zoom in on these distributions and note some interesting observations: (1) average nucleotide frequency fluctuations over time are much smaller than changes in variance (for most genes, variance fluctuates similarly for all clades but Omicron), and (2) the average and variance of these distributions over time are shown to positively covary for Spike, ORF3a, ORF3b, E, ORF6, ORF7a, N, and ORF10 genes, and negatively covary for ORF1ab, M, ORF7b, and ORF8 genes. Many biological analyses focus on following the average over time (even when full distribution data is displayed) [
24,
47], but other moments (i.e., properties or characteristics) of the distribution can influence the evolutionary trajectory of a given clade. Following the variance in nucleotide frequency and not just the average is important since large distribution variances tend to reduce the strength of selection; and therefore, reduce the likelihood of a particular nucleotide increasing in frequency over time (everything else held equal) [
52]. For most SARS-CoV-2 genes, changes in variance over time for all nucleotides are quite small, though increasing in most cases and peaking in different months for different genes. Selection’s increasingly weak magnitude, independent of direction, has been relatively maintained across these 32 months for SARS-CoV-2. Nonetheless, we documented many significant differences between the clades and their nucleotide distributions over time. We noted in the results that there were more differences in some months over others but did not find any causal reasoning behind it.
Our codon analysis was performed to assess the codon bias within each of the five clades, across 12 genes, and over 32 months. We found that the nucleotide preference within codons does not strictly adhere to the nucleotide bias discussed above for whole genes. For example, the top three overrepresented codons for ORF1ab (GCT, AGA, and GGT) and Spike (AGA, GGT, and TCT) genes use the G nucleotide most often, though the T nucleotide is preferred in the third position of the codon (Fig.
4B, D). The unique nucleotide bias of N gene (A > C > G > T) (Fig.
2G) resulted in a bias towards the A nucleotide within its codons. The T nucleotide is the least used nucleotide across the N gene; nonetheless, it is more likely to reside in the third position among the top overrepresented codons (GCT, AGA, and ACT) (Fig.
4F). Similar RSCU results were found when concatenating genes—codon AGA is found to possess an RSCU value > 2, regardless of date or location of samples [
34,
47,
50].
When we look closely at the codon with the highest RSCU value for ORF1ab, Spike and N genes (GGT, AGA, and AGA respectively), there is an overall decrease in the average frequency across most clades over time (Beta/Mu increases for ORF1ab and Spike genes, Alpha increases for Spike gene) (Additional file
9: Table S9). We also found an overall decrease in the frequency variance for ORF1ab (GGT) and N (AGA) genes, and an increase for the Spike (AGA) gene over time (Beta/Mu increases for ORF1ab gene) (Additional file
10: Table S10). This means the SARS-CoV-2 individual codon usage over time looks more alike with fewer GGT codons in their ORF1ab gene and AGA codons in their N gene. Meanwhile, diversity in the use of the Spike gene AGA codon is increasing. This is interesting because the three codons with the highest RSCU values in humans (CTG, GCC, and CAG) are GC heavy, especially in the first and third codon positions. Overall, the SARS-CoV-2 codon usage remains quite different from human usage [
53]. This antagonism may aid the SARS-CoV-2 population by promoting better folded proteins [
34] and avoiding competition [
25].
We narrowed our focus to a specific amino acid group to identify explicit differences between synonymous codons within each clade and across genes ORF1ab, Spike, and N. Proline codon (CCA, CCC, CCG, and CCT) usage was of particular interest to follow because of its slow rate of translation and requirement of a specific elongation factor (eIF5A) for translation of polyproline sequences [
54]. ORF1ab, Spike, and N genes showed slightly different proline codon preferences that were maintained over the 32 months for each clade (Fig.
4A, C, E). All three genes for all clades preferred codons CCA and CCT, but the rarely used proline codons are what set these three genes and clades apart. CCG and CCC are regularly used in the ORF1ab gene at similar frequencies, CCC in the Spike gene, and both CCG and CCC in the N gene but at differing frequencies. This codon usage is consistent with the results of the top three RSCU valued codons in ORF1ab, Spike and N genes, where A and T nucleotides are more likely to be present in the third position. It has been shown in bacteria that the synonymous proline codon used matters in the context of translation efficiency [
55]. Human proline RSCU (CCC > CCT > CCA > CCG) is quite a bit different than SARS-CoV-2 usage and may further promote the slowing of translation [
53].
We also looked more closely at codons for each gene that resulted in the largest total number of significant differences between clades summed over all 32 months (Additional file
7: Table S7 and Additional file
22: Figure S10). Within each month, codon distributions for each clade are compared pairwise, unless sample size is insufficient for the calculation. With very little variation within each clade’s codon distributions over time (as seen in Fig.
3 and Additional file
23: Figure S11 and Additional file
24: Figure S12), we found that this is enough variation to drive significant differences between clades. There were a few interesting observations of these codons (ORF1ab—CTT, Spike—GAT, ORF3a—CAG, ORF3b—TTA, E—CTG, M—TAC, ORF6—GAC, ORF7a—GTT, ORF7b—TTG, ORF8—CAA, N—CAG, and ORF10—CTC): (1) they consist of mostly T nucleotides and few G nucleotides, (2) they are more likely to possess a C nucleotide in the first position, equally an A or T nucleotide in the second position, and a G nucleotide in the third position, and (3) CAG (glutamine) is the only codon with the most significant differences found in more than one gene (ORF3a and N). We also analyzed the average and variance over time of these codon frequency distributions and found that most clades of a given gene fluctuate together. We show that within a gene, these codons are more likely to decrease in average and variance over time (Additional file
9: Table S9 and Additional file
10: Table S10). As we mentioned when discussing nucleotide frequency, following the variance in these codon frequency distributions matters. Reducing the variance in codon frequency increases the magnitude of selection [
52]; and therefore, the direction selection pushes the SARS-CoV-2 population may not change much, but the step size at which it does so can potentially aid the divergence of the SARS-CoV-2 clades.
Our CAI analysis shows that SARS-CoV-2 codon usage similarity to the host does change over time, though only slightly in most cases, and demonstrates that trends are not identical for different genes (Fig.
5 and Additional file
25: Figure S13). Spike gene, which is considered an important and consequential antigenic site for the virus [
56], had fairly consistent CAI values for most clades, but a dramatic drop in CAI for Omicron, indicating that more recent (and contagious) strains of the virus are moving away from human codon usage. No other gene mimics this trend for Omicron (the closest being ORF6 gene, though this still shows an increase in CAI in later months), which could point towards a unique codon usage characteristic of the Spike gene. Posani et al
. calculated and compared CAI for several genes, including Spike gene, from December 2019 to February 2021, and their primary conclusion was that CAI tends to decrease over time for these genes [
8]. Mogro et al
. calculated changes in CAI from December 2019 to September 2021, finding a negative trend in CAI when all genes are concatenated. However, this trend becomes less apparent when sequences were separated by gene or clade [
24]. Their dataset for the dN/dS analysis is considerably more constrained (a total of about 200,000 sequences). Our analysis still concurs with their observation that CAI does not have a consistent downwards trend over time when sequences are separated by gene, and there are notable differences between different clades, genes, and timepoints in CAI [
24].
Our dN/dS analysis also finds the Spike gene to be different from the other genes—for Alpha, Delta, and Omicron, the ratio is above one for all time points and for Gamma/Lambda, it is above one for most timepoints after 2021. This shows that nonsynonymous changes are favored within the Spike gene for these clades and timepoints. There are no other genes which have any clades with dN/dS consistently above one other than ORF8 gene Alpha, which is above 1 for a few timepoints in early 2021. This is yet another way in which Spike gene’s mutation profile differs from other SARS-CoV-2 genes, which could be connected to its role as the primary cell entry mediator for the virus. Nikolaidis et al
. performed an analysis calculating dN/dS over time for about 850,000 sequences (mixed NCBI and GISAID data) up till December 2021 for Spike, ORf1a, and ORF1b genes; they note that dN/dS had started to increase notably for Spike gene in the second year of the pandemic, though still remained relatively close to one. The average dN/dS ratio for Spike gene was noticed to be above one for all clades [
57]. This is fairly consistent with our results, though their dN/dS numbers are lower than ours for Spike gene at similar time points. Of note, our data includes Omicron within the first half of 2022, during which dN/dS increased far more than any previous time point. We also show data for genes not featured in the aforementioned study—all of the dN/dS values for these genes are far lower than that of Spike gene.
Visualization of the synonymous and nonsynonymous mutations of SARS-CoV-2 using heatmaps supported our dN/dS calculations for Spike gene, showing that nonsynonymous mutations far outnumber synonymous ones for both Omicron and Delta. This contrasts with other RNA viruses, which show comparatively fewer nonsynonymous mutations. In a clinical study investigating intra-host single-nucleotide variants (iSNVs), nonsynonymous substitutions were noted to be overrepresented, indicating that there is a biological phenomenon at play in determining the synonymous and nonsynonymous mutations observed for this virus [
58‐
60]. One factor that could be relevant to this observed difference between SARS-CoV-2 and other RNA viruses is the mechanisms underlying the fidelity of viral polymerases and the absence/presence of the associated proofreading activities. Literature sources indicate that the Dengue virus RNA-Dependent RNA polymerase (RdRp) has an error rate of approximately 1e
−4 mutations per base pair [
58‐
60], Influenza A RNA polymerase has an error rate of approximately 1.5e
−5 mutations per base pair [
61], and SARS-CoV-2 RdRp has an error rate between 1e
−6 and 5e
−6 mutations per base pair [
62] (estimations for the MERS-CoV mutation rate exist, but they do not use comparable methodologies to the ones cited here for the other viruses). SARS-CoV-2’s substantially lower error rate than the other RNA viruses points to a different proofreading mechanism, which could be linked to the preponderance of nonsynonymous mutations—it is possible that this mechanism disfavors synonymous changes. Specifically, SARS-CoV-2 (along with other coronaviruses) encodes nsp14, a 3′–5′ exoribonuclease that ameliorates the poor fidelity of the RdRp [
63]—this is encoded within the ORF1ab region of the genome [
64]. Influenza A RNA polymerase and Dengue virus RdRp are both considered low fidelity replicases [
60,
65]. The additional proofreading mechanism in SARS-CoV-2 could be tied to the difference observed in synonymous and nonsynonymous mutations when compared to other RNA viruses—though this would require further investigation to confirm.
We further show the codon and nucleotide composition of SARS-CoV-2 Spike Omicron and Delta in Additional file
29: Figure S17, and tabulate the portion of synonymous, nonsynonymous, and WT codons in Table
6. Together, these data support the notion that nonsynonymous changes far outnumber synonymous ones for SARS-CoV-2, but this is highly position-dependent, and at certain locations along the Spike sequence, synonymous changes are overwhelmingly dominant. Altogether, this could point towards an underlying biological mechanism in SARS-CoV-2 that favors nonsynonymous changes in most locations, or it could be that SARS-CoV-2 is a relatively “recent” virus (at least, compared to the other viruses featured in our study), and might show a different synonymous/nonsynonymous mutation profile later in its evolution.