Background
Coronaviruses (CoVs) are positive, singe stranded RNA viruses of the order
Nidovirales, family
Coronaviridae, subfamily
Orthocoronavirinae, with four genera, namely alpha [α], beta [β], gamma [γ] and delta [δ], which have been further subdivided into 25 subgenera, including five for β-CoVs:
Sarbecovirus,
Merbecovirus,
Embecovirus,
Nobecovirus and
Hibecovirus [
1], and fifteen for α-CoVs:
Luchacovirus,
Decacovirus,
Nyctacovirus,
Minunacovirus,
Pedacovirus,
Colacovirus,
Myotacovirus,
Duvinacovirus,
Setracovirus,
Rhinacovirus,
Tegacovirus,
Minacovirus,
Sunacovirus,
Soracovirus, and
Amalacovirus [
2]. Seven CoVs infect humans; two of the α-genus (the
Duvinacovirus hCoVs 229E and the
Setracovirus NL63) and five of the β-genus: the
Sarbecoviruses severe acute respiratory syndrome (SARS)-CoVs 1 and 2, the latter responsible for a pandemic since 2019 [
3‐
6]; the
Merbecovirus Middle East respiratory syndrome (MERS) CoV; and the
Embecoviruses hCoV-OC43 and -HKU1. Human CoVs have a zoonotic origin, with bats as key reservoir [
7] and possibly other hosts [
8,
9]. Bat β-CoVs related to human CoVs belong to the
Sarbecovirus,
Nobecovirus, and
Hibecovirus subgenera [
10‐
12].
Coronaviruses display substantial genomic plasticity and resilience [
13,
14] via recombination, point mutations, deletions, and insertions, which are reported to drive variant emergence, host range, gene expression, transmissibility, immune escape, and virulence [
15‐
20]. The use of an RNA-dependent-RNA polymerase (RdRp)-driven template switching mechanism for transcription and control of structural and accessory gene expression in CoVs [
20] has been reported to account for the high frequency of recombination [
13,
18,
21‐
27].
In template switching, a leader transcription regulatory sequence (TRS-L; ACGAAC core in β-CoVs) [
28] in the 5′-untranslated region (UTR) interacts with homologous TRS-body (B) elements upstream of viral genes in the last third of the genome (illustrated for SARS-CoV-2 in Additional file
1: Fig. S1A) [
29,
30]. Template switching renders the neighborhood of TRS-Bs, especially that for the spike gene, a recombination hotspot during viral transcription [
3,
16,
21,
22,
24,
27,
31‐
34].
Viral subgenomic messenger RNAs contain a 5′-leader sequence that spans from the terminal 5′-cap (m
7G) structure to the TRS-L and harbors three conserved stem-loop (SL1-3) regulatory elements of gene expression and replication (Additional file
1: Fig. S1B) [
35‐
37]. The TRS-L core sequence and the secondary structure of the leader sequence are conserved within but not among coronavirus genera (Rfam database:
http://rfam.xfam.org/covid-19).
The entire 5′-leader nucleotide sequence of SARS-CoV-2, and beyond up to almost SL5 can be translated into a peptide sequence (Additional file
1: Fig. S1B), and although there is no evidence for the functionality of any open reading frame within the UTRs [
36,
38], the 5′-leader sequence could be translated after most of it (nucleotides 8–80, including SL1-3 and TRS-L) is duplicated and translocated to the distal end of the accessory ORF6 gene of a SARS-CoV-2 variant with deleted ORFs 7a, 7b and 8 isolated from 3 patients in Hong Kong [
39]. We also found that a shorter portion of the 5′-leader sequence (nucleotides 50–75) is duplicated and translocated to the end of the accessory ORF8 gene of a USA variant (accession number: QUP34336) that could be translated into a modified ORF-8 protein, which prompted us to conduct a systematic analysis.
In the present study, using 5′-leader nucleotide sequences and amino acid sequences translated in the three reading frames as queries to search public databases, we document the presence of intragenomic rearrangements involving segments of the 5′-leader sequence in geographically and temporally diverse isolates of SARS-CoV-2. The intragenomic rearrangements could modify the carboxyl-termini of the ORF8 (also in Rhinolophus bat Sarbecovirus β-CoVs) and ORF7b proteins; the serine-arginine-rich region of the nucleocapsid protein, generating the well characterized R203K/G204R paired mutation; and two sites of the NiRAN domain of the RdRp (nsp12).
Beyond SARS-CoV-2, we found similar rearrangements of 5′-UTR leader sequence segments including the TRS-L in all subgenera of β-CoVs except for
Hibecovirus (possibly secondary to the availability of only 3 sequences in GenBank). These rearrangements are in the intergenic region between ORFs 3 and 4a, and at the distal end of ORF4b of the
Merbecovirus MERS-CoV; intergenic regions in the
Embecoviruses hCoV-OC43 (between S and Ns5) and hCoV-HKU-1 (between S and NS4); and in the distal end that encodes the Y1 cytoplasmic tail domain of nsp3 of
Nobecoviruses of African
Rousettus and
Eidolon bats. We also found intragenomic rearrangements in α-CoVs in nsp2 (
Luchacovirus subgenus), nucleocapsid (
Nyctacovirus subgenus), and ORF5b or ORF4b (
Decacovirus subgenus). No rearrangements involving 5′-UTR sequences were detected for the β-CoV SARS-CoV-1; the other 12 subgenera of α-CoVs including hCoV-229E and hCoV-NL63 infecting humans; or δ (
Andecovirus,
Buldecovirus, and
Herdecovirus subgenera) and γ CoVs (
Brangacovirus,
Cegacovirus, and
Igacovirus subgenera) for which wild birds are the main reservoir [
12,
40].
The present study highlights an intragenomic source of variation involving duplication, inversion (in two α-CoVs subgenera) and translocation of 5′-UTR sequences to the body of the genome with potential implications on gene expression and immune escape of α- and β-CoVs in humans and bats causing mild-to moderate or severe disease in endemic, epidemic, and pandemic settings. Genome-wide annotations had revealed 1516 nucleotide-level variations at various positions throughout the entire SARS-CoV-2 genome [
41] and a recent study documented outspread variations of each of the six accessory proteins across six continents of all complete SARS-CoV-2 proteomes which was suggested to reflect effects on SARS-CoV-2 pathogenicity [
42]. However, the function and even expression of some of these accessory proteins remains a matter of debate due to inconsistencies derived from the use of bioinformatics predictions, and studies in different cell types and not in in vivo infection settings. The intragenomic rearrangements involving 5′-UTR sequences described here, which in several cases affect highly conserved genes with a low propensity for recombination, may underlie the generation of variants homotypic with those of concern or interest and with potentially differing pathogenic profiles.
Discussion
We here describe intragenomic rearrangements involving 5′-UTR sequences and the coding section of the genome of beta- and alphacoronaviruses. Additional file
1: Fig. S4A summarizes the locations of insertions in accessory, structural, and nonstructural genes of SARS-CoV-2, which for at least the accessory and structural genes appear to involve and/or affect the template switching mechanism by creating new regions of homology for interaction with TRS-L. The presence of conserved complementary sequences (CCSs) in the 5′- and 3′-UTRs potentially involved in circularization of the genome during subgenomic RNA synthesis has been reported [
74]. As shown in Additional file
1: Fig. S4B, the 5′-UTR sequences involved in intragenomic rearrangements in SARS-CoV-2 shown in the present work usually include the TRS-L and span approximately half of the 5′ CCS, thus potentially facilitating circularization of the genome from locations closer to the 3′-UTR. The 5′-UTR sequences involved in intragenomic rearrangements may also facilitate other long-distance RNA-RNA interactions contributing to the complex coronavirus transcription process [
75].
Most of the 5′-UTR sequences duplicated and translocated include TRS-L. Extending the homology region of interaction between the TRS-L in the 5′-leader and the TRS-L introduced in a particular area of the body of the genome optimizes minimum free energy of the interaction. Such facilitation may favor expression of certain genes over that of others, thereby altering the hierarchy in gene expression. Because insertions are in various locations of viral genes, including some encoding nonstructural proteins, they may propitiate formation of new subgenomic RNAs thereby expanding the repertoire of proteins and even transforming noncanonical subgenomic messenger RNAs, i.e., not associated with TRS homology, to canonical ones. SARS-CoV-2 and other CoVs have been reported to generate noncanonical subgenomic RNAs in abundance, accounting for up to a third of subgenomic messenger RNAs in cell culture models of infection and increasing in proportion over time [
76].
The structural genes control genome dissemination [
63] while the accessory genes in the same region of the genome may be involved in adaptation to specific hosts, modulation of the interferon signaling pathways, the production of pro-inflammatory cytokines, or the induction of apoptosis [
77], among other mechanisms underlying immune evasion and pathogenesis. Gaining insight into the effect of the amino acid changes introduced by the 5′-UTR sequences is likely to shed light into pathogenesis and immune evasion mechanisms. For instance, a few point mutations can have a profound effect as exemplified by the few mutations in the C-terminus of the spike protein that transform the feline CoV associated with mild disease to one, the feline infectious peritonitis virus, which is generally lethal [
78].
ORF8 had been postulated to originate from
ORF7a by non-homologous recombination, and a predicted structure model of the ORF8 protein of SARS-CoV-2 revealed a ~ 60-residue core like that of SARS-CoV-2 ORF7a protein [
79] with the addition of two dimerization interfaces, one covalent and the other noncovalent, unique to SARS-CoV-2 ORF8 [
80]. In the C-terminus of ORF8 that would be predicted to be altered by 5′-UTR sequence insertions (i.e.,
115RVVLDFI
121), R115, D119, F120, and I121 contribute to the covalent dimer interface (marked with asterisks in Fig.
1) with R115 and D119 forming salt bridges that flank a central hydrophobic core in which V117 interacts with its symmetry-related counterpart [
80].
How the C-terminal insertions and changes therein affect the dimerization of ORF8 protein remains to be determined and described functions for ORF8 protein remain a matter of debate [
81]. However, the predicted changes caused by insertions might contribute to immune evasion by SARS-CoV-2 by affecting the interactions of the ORF8 glycoprotein homodimer with intracellular transport signaling, leading to down-regulation of MHC-I by selective targeting for lysosomal degradation via autophagy [
82], and/or extracellular signaling involving interferon-I signaling [
83], mitogen-activated protein kinases growth pathways [
84], the tumor growth factor-β1 signaling cascade [
85] and interleukin-17 signaling promoting inflammation and contributing to the COVID-19-associated cytokine storm [
86].
The carboxyl-terminal region of the ORF8 protein may include T- and/or B-cell epitopes that may be affected by the variations described. To this end, approximately 5% of CD4+ T cells in most COVID-19 cases are specific for ORF8 protein, and ORF8 protein accounts for 10% of CD8+ T cell reactivity in COVID-19 recovered subjects [
87,
88]. Another possible effect of the insertions stems from the fact that anti-ORF8 protein antibodies are detected in both symptomatic and asymptomatic patients early during infection by SARS-CoV-2 [
89] and diagnostic assays for SARS-CoV-2 infection that target only accessory genes or proteins such as ORF8 may be affected [
39].
In terms of the potential consequences of intragenomic rearrangements involving
ORF7b of SARS-CoV-2, the function of the SARS-CoV-2 ORF7b protein remains to be determined and has been suggested to mediate tumor necrosis factor-α-induced apoptosis based on cell culture data [
90] and theoretically the dysfunction of olfactory receptors by triggering autoimmunity [
91].
We also found intragenomic rearrangements in the nucleocapsid gene of SARS-CoV-2 and bat α-CoVs subgenus
Nyctacovirus. The nucleocapsid is the most abundant protein in CoVs, interacts with membrane protein [
92,
93], self-associates to provide for efficient viral assembly [
94], binds viral RNA [
95] and has been involved in circularization of the murine hepatitis virus genome via interaction with 3′- and 5′-UTR sequences which may facilitate template switching during subgenomic RNA synthesis [
96]. Phosphorylation transforms N-viral RNA condensates into liquid-like droplets, which may provide a cytoplasmic-like compartment to support the protein’s function in viral genome replication [
93,
97].
The phosphorylation-rich stretch encompassing amino acid residues 180–210 (SR region) encoded by the nucleotide segment where 5′-UTR sequences were detected in SARS-CoV-2, serves as a key regulatory hub in N protein function within a central disordered linker for dimerization and oligomerization of the N protein, which is phosphorylated early in infection at multiple sites by cytoplasmic kinases [
97]. Serine 202 (numbering of reference Wuhan strain), which is phosphorylated by GSK-3, is conserved in the predicted translated 5′-UTR sequence next to the R203K/G204R co-mutation, as is threonine 205, which is phosphorylated by PKA [
98,
99]. R203 and G204 mutations affect the phosphorylation of serines 202 and 206 in turn affecting binding to protein 14-3-3 and replication, transcription, and packaging of the SARS-CoV-2 genome [
100‐
102].
The
N gene displays rapid and high expression, high sequence conservation, and a low propensity for recombination [
34,
103,
104]. However, it can show variation driven by internal rearrangement which does not affect the length of the protein. The N protein is highly immunogenic, and its amino acid sequence is largely conserved, with the serine-arginine (SR) region being a strong immunodominant B-cell epitope [
105] as highlighted in Fig.
3A.
The functional significance of the intragenomic rearrangement in
N of bat α-CoVs subgenus
Nyctacovirus remains to be determined. Although in infectious bronchitis virus, the amino terminal domain of N protein has been shown to interact with nucleotide sequences in the 3′-UTR which is relevant to viral RNA packaging, the amino acids that are critical for such interaction are more distally located in the amino terminus (amino acids 76 or 94) [
106,
107] than those encoded by the intragenomic rearrangement in this case.
The intragenomic rearrangements found in MERS-CoV may modulate immune evasion by bringing regulatory sequences to the intergenomic regions preceding the
4a and
5 genes and modulating their expression. p4a, a double stranded RNA-binding protein, as well as p4b and p5 of MERS-CoV are type-I IFN antagonists [
108‐
111]. p4a prevents dsRNA formed during viral replication from binding to the cellular dsRNA-binding protein PACT and activating the cellular dsRNA sensors RIG-I and MDA5 [
110,
111]. p4a is the strongest in counteracting the antiviral effects of IFN via inhibition of both its production and Interferon-Stimulated Response Element (ISRE) promoter element signaling pathways [
112]. The latter findings were obtained in cell cultures and studies in an in vivo infection are warranted. To this end, a more recent study associated p4b with inflammatory pathology and suppression of autophagy in murine lungs thereby highlighting the complex interplay of proteins during virus replication under in vivo physiological conditions [
113].
Like SARS-CoVs and MERS-CoV, hCoV-OC43 can downregulate the transcription of genes critical for the activation of different antiviral signaling pathways [
114], and the intragenomic rearrangements described in the intergenic region preceding hCov-OC43
ns5a may modulate immune evasion. To this end, hCoV-OC43 ns5a, as well as ns2a, M, or N proteins significantly reduced the transcriptional activity of ISRE, IFN-β promoter, and NF-κB-RE following challenge of human embryonic kidney 293 (HEK-293) cells with Sendai virus, IFN-α or tumor necrosis factor-α [
115].
In hCoV-HKU-1 and hCoV-OC43, intragenomic rearrangements involved the intergenic region at the end of the
S gene highlighting a potential source of regulatory sequences that may affect expression of adjoining genes. The Spike (
S) gene encodes a structural protein that binds to the host receptors and determines cell tropism as well as the host range. The neighborhood of the spike gene, particularly the region before the S gene, is a hotspot for modular intertypic homologous and non-homologous recombination in coronavirus genomes [
34].
Although the nsp3 protein sequence is well conserved among bat
Nobecoviruses, the significance of the nsp3 segment encoded by the 5′-UTR sequence, which might affect double vesicle membrane formation, remains to be determined. Nsp3 protein, the largest protein encoded by CoVs encompasses up to 16 modular domains. The N-terminal cytosolic domains include a mono-ADP-ribosylhydrolase, a papain-like protease [
116], and a scaffold region that participates in replication-transcription complex assembly [
117]. After the latter domains, there are two transmembrane domains (TM1 and TM2) with an endoplasmic reticulum luminal loop (Ecto3) between them, and two cytosolic domains (Y1 and CoV-Y) following TM2. The predicted nsp3 segment encoded by the 5′-UTR sequence falls in the cytosolic domain Y1. Nsp3C anchors nsp3 to the endoplasmic reticulum membrane and induces membrane rearrangement leading to double membrane vesicle formation via a yet unknown molecular mechanism [
118,
119]. Although there are structural data on the CoV-Y domain [
120], its function is unknown as is that of the Y1 domain.
The discontinuous RNA synthesis of the polymerase machinery of coronaviruses along with the use of canonical and noncanonical TRS-L and TRS-B pairing may enhance the occurrence of insertions (via intragenomic rearrangements or other means) and deletions, which can remain uncorrected by the proofreading activity of nsp14 exoribonuclease [
121]. Most insertion and deletions likely negatively affect viral fitness [
122] and duplication of TRS sequences in coronaviruses led to attenuation [
123] and when affecting essential genes frequently to viral genetic instability [
124]. However, a small number of insertions/deletions emerge and spread in viral populations, suggesting a positive effect on fitness and adaptive evolution [
125‐
131]. Thus, analyzing these insertion/deletions may reveal evolutionary trends and provide new insight into the surprising variability and rapidly spreading capability that SARS-CoV-2 has shown since its emergence. One usual target of deletions is the accessory ORFs in the distal third of the genome, because they do not appear to participate in viral replication but can allow the virus to evade host defenses. Variants with these deletions occur naturally in SARS-CoV-2 and spread without apparently affecting virus infectivity.
Some of the intragenomic rearrangements described here in
ORF8 and
ORF7a and one previously in
ORF6 occurred in viruses with deletions that removed or truncated ORFs, such as the deletion in the B.1.36.27 lineage from Hong Kong which lacks
ORFs 7a,
7b, and
8 and has the last 12 nucleotides of the
ORF6 replaced by ~ 60 nucleotides from the 5′-UTR [
39]. An 872-nucleotide deletion described in the AY.4 lineage (Delta variant) from Southern Poland also eliminated
ORFs 7a,
7b and
8 [
132], as did a 872-nucleotide deletion documented in late 2021 in Uruguay in a different Delta lineage (AY.20), with viruses without the deletion coexisting with wild-type AY.20 and AY.43 strains [
128,
129].
Two large and phylogenetically unrelated deletions (392 and 227 nucleotides long) fused
ORF7a with downstream ORFs [
133]. One, a 392-nucleotide deletion, lacked
ORF7b and created a new ORF including
ORF7a and
ORF8, while the other, a 227-nucleotide deletion, resulted in a new ORF by combining the proximal end of
ORF7a with
ORF7b. These deletions have become extinct or appear as sporadic or unique variants [
39,
133]. On the other hand, a 382-nucleotide deletion that removes most of the ORF8 was a circulating form hypothesized to lead to an attenuated phenotype of SARS-CoV-2 [
130,
131].
Intragenomic rearrangements in isolates with large deletions, as exemplified by those involving
ORF6 [
39],
ORF7b and
ORF8 of SARS-CoV-2, in all cases thus far affect the carboxyl-termini of the predicted encoded proteins. The length of the insertions does not notably affect that of the predicted proteins in isolates without major genomic deletions. For 5′-UTR segments within viral genes, such as the examples shown in
N,
nsp12 and
nsp3, or intergenic regions, the length of the protein or intergenic region appears not to be affected.
Intragenomic rearrangements are yet another example of the tremendous genomic flexibility of coronaviruses which underlies changes in transmissibility, immune escape and/or virulence documented during the SARS-CoV-2 pandemic.
Limitations
The intragenomic rearrangements involving 5′-UTR sequences were detected in all subgenera of β-coronaviruses infecting humans (i.e., Sarbecovirus, Embecovirus, and Merbecovirus) and in the Nobecovirus but not the Hibecovirus subgenera of CoVs infecting bats. There were only 3 Hibecovirus genomes in the database, which may account for the lack of detection of internal rearrangements in this subgenus most closely related to Sarbecoviruses. In this respect, the most diverse detection of rearrangements in SARS-CoV-2 may reflect the bias generated by the presence in GenBank of SARS-CoV-2 isolates in up to 5 orders of magnitude greater number than any other CoV. However, the relative paucity of α-, γ-, or δ-CoV sequences available also applies to those of β-CoVs other than SARS-CoV-2 for which 5′-UTR rearrangements were found in notable proportions. Moreover, the present analysis included CoVs involved in large outbreaks such as the swine enteric CoVs of the α and δ genera and avian infectious bronchitis virus of the γ genus that have been studied over decades with hundreds of isolates characterized without apparent evidence for intragenomic rearrangements. The apparent absence of internal rearrangements in the latter viruses bodes well for the specificity of the findings described here for 4 of 5 subgenera of β-CoVs and 3 of 12 subgenera of α-CoVs.
Many sequences in the databases have incomplete 5′-UTRs rendering it difficult to comprehensively analyze them and to calculate more reliable proportions of variations. There are also partial genome and protein sequences, and we excluded sequences with undetermined amino acids. Nonetheless, for SARS-CoV-2, the frequency of variants with full-length insertions appears low relative to those with subsegments or other mutations in comparison to the reference strain in the same insertion area. One could posit that for hCoV-OC43 and hCoV-HKU-1, the apparently much higher frequency of intragenomic rearrangements involving 5′-UTR sequences might be driven by characterization of a greater number of isolates during epidemics with rearrangements possibly providing transmissibility, immune evasion and/or virulence advantages.
A limitation of the methods used for detecting these isolates is that they may not be viable, i.e., they may be associated with molecular diagnostic detection of virus but not necessarily culture conversion, or may represent artifacts of sequencing; however, their prevalence with redundancy in various locations and processing laboratories would be consistent with human-to-human transmission. Moreover, Turakhia et al. [
134], among others, have pointed out that systematic errors associated with lab-or protocol-specific practices affect some sequences in the repositories, which are predominantly or exclusively from single labs, co-localize with commonly used primer binding sites and are more likely to affect the protein-coding sequences than other similarly recurrent mutations. Although we cannot rule out that such systematic errors as well as wrong short reads alignment may underlie some if not all the rearrangements detected, the possibility is rendered less likely by the geographic and temporal diversity of the isolates with each intragenomic rearrangement (as underscored by the data in the Additional file
1: legends to Figures and Table), their presence in diverse variants of concern, as well as the occurrence of rearrangements in sequences from before the pandemic era and among diverse viruses of two genera and various subgenera in at least three hosts (humans, bats, and rodents). Moreover, it is unlikely that the insertion in the nucleocapsid gene of SARS-CoV-2 which encodes for a common co-mutation of adjacent sites that has been shown experimentally to have functional significance reflects an artifactual event. Finally, when using peptides as query sequences for SARS-CoV-2 we verified that the nucleotide sequences encoding the detected peptides were identical to 5′-UTR sequences. However, we cannot rule out that the sequences detected in intragenomic rearrangements may have arisen from host cell genomes or other sources.