Abstract
The outbreak of severe acute respiratory syndrome-coronavirus-2 (SARS-CoV-2) has caused an unprecedented pandemic. Since the first sequenced whole-genome of SARS-CoV-2 on January 2020, the identification of its genetic variants has become crucial in tracking and evaluating their spread across the globe.
In this study, we compared 134,905 SARS-CoV-2 genomes isolated from all affected countries since the outbreak of this novel coronavirus with the first sequenced genome in Wuhan, China to quantify the evolutionary divergence of SARS-CoV-2. Thus, we compared the codon usage patterns of SARS-CoV-2 genes encoding the membrane protein (M), envelope (E), spike surface glycoprotein (S), nucleoprotein (N), RNA-dependent RNA polymerase (RdRp). The polyproteins ORF1a and ORF1b were examined separately.
We found that SARS-CoV-2 tends to diverge over time by accumulating mutations on its genome and, specifically, on the sequences encoding proteins N and S. Interestingly, different patterns of codon usage were observed among these genes. Genes S and N tend to use a narrower set of synonymous codons that are better optimized to the human host. Conversely, genes E and M consistently use the broader set of synonymous codons, which does not vary in respect to the reference genome. CAI and SiD time evolutions show a tendency to decrease that emerge for most genes. Forsdyke plots are used to study the nature of mutations and they show a rapid evolutionary divergence of each gene, due to the low values of x-intercepets.
Competing Interest Statement
The authors have declared no competing interest.