Background
Coronaviruses (CoVs) cause respiratory and intestinal infections in animals and humans [
1]. They were not considered to be highly pathogenic to humans until the last two decades, which have seen three outbreaks of highly transmissible and pathogenic coronaviruses, including SARS-CoV (severe acute respiratory syndrome coronavirus), MERS-CoV (Middle East respiratory syndrome coronavirus), and SARS-CoV-2 (which causes the disease COVID-19). Other human coronaviruses (such as HCoV-NL63, HCoV-229E, HCoV-OC43 or HKU1) generally induce only mild upper respiratory diseases in immunocompetent hosts, although some may cause severe infections in infants, young children and elderly individuals [
1].
Extensive studies of human coronaviruses have led to a better understanding of coronavirus biology. Coronaviruses belong to the family
Coronaviridae in the order nidovirales. Whereas MERS-CoV is a member of the
Merbecovirus subgenus, phylogenetic analyses indicated that SARS-CoV-2 clusters with SARS-CoV in the
Sarbecovirus subgenus [
2]. All human coronaviruses are considered to have animal origins. SARS-CoV, MERS-CoV and SARS-CoV-2 are assumed to have originated in bats [
1]. It is widely believed that SARS-CoV and SARS-CoV-2 were transmitted directly to humans from market civets and pangolin, respectively, based on the sequence analyses of CoV isolated from these animals and from infected patients.
All members of the coronavirus family are enveloped viruses that possess long positive-sense, single-stranded RNA genomes ranging in size from 27 to 33 kb. The coronavirus genomes encode five major open reading frames (ORFs), including a 5′ frameshifted polyprotein (ORF1a/ORF1ab) and four canonical 3′ structural proteins, namely the spike (S), envelope (E), membrane (M) and nucleocapsid (N) proteins, which are common to all coronaviruses [
3]. In addition, a number of subgroup-specific accessory genes are found interspersed among, or even overlapping, the structural genes. Overlapping genes originate by a mechanism of overprinting, in which nucleotide substitutions in a pre-existing frame induce the expression of a novel protein in an alternative frame. The accessory proteins in coronaviruses vary in number, location and size in the different viral subgroups, and are thought to contain additional functions that are often not required for virus replication, but are involved in pathogenicity in the natural host [
4,
5].
In the face of the ongoing COVID-19 pandemic, extensive worldwide research efforts have focused on identifying coronavirus genetic variation and selection [
6‐
8], in order to understand the emergence of host/tissue specificities and to help develop efficient prevention and treatment strategies. These studies are complemented by structural genomics [
9‐
11], as well as transcriptomics [
12] and interactomics studies [
13] of the structural and putative accessory proteins.
However, there have been less studies of accessory proteins, for two main reasons [
14]. First, accessory proteins are often not essential for viral replication or structure, but play a role in viral pathogenicity or spread by modulating the host interferon signaling pathways for example. This has led to some contradictory experimental results concerning the presence or functionality of accessory proteins. For instance, in a recent experiment [
13] to characterize SARS-CoV-2 gene functions, 9 predicted accessory protein ORFs (3a, 3b, 6, 7a, 7b, 8, 9b, 9c, 10) were codon optimized and successfully expressed in human cells, with the exception of ORF3b. However, another recent study using DNA nanoball sequencing [
12] concluded that the SARS-CoV-2 expresses only five canonical accessory ORFs (3a, 6, 7a, 7b, 8).
Second, bioinformatics approaches for the prediction of accessory proteins are challenged by their complex nature as short, overlapping ORFs. Such proteins are known to have biased amino acid sequences compared to non-overlapping proteins [
15]. In addition, the homology-based approaches widely used to predict ORFs in genomes are less useful here, because many accessory proteins are lineage- or subgroup-specific. Thus, many state of the art viral genome annotation systems, such as Vgas [
16], only predict overlapping proteins if homology information is available. Other methods have been developed dedicated specifically to the ab initio prediction of overlapping genes, for example based on multiple sequence alignments and statistical estimates of the degree of variability at synonymous sites [
17] or sequence simulations and calculation of expected ORF lengths [
18].
Here, we propose a computational tool GOFIX (Gene prediction by Open reading Frame Identification using
X motifs) to predict potential ORFs in virus genomes. Using a complete viral genome as input, GOFIX first locates all potential ORFs, defined as a region delineated by start and stop codons. In order to predict functional ORFs, GOFIX calculates the enrichment of the ORFs in
X motifs, i.e. motifs of the
X circular code [
19], a set of 20 codons that are over-represented in the reading frames of genes from a wide range of organisms. For example, in a study of 299,401 genes from 5217 viruses [
20] including double stranded and single stranded DNA and RNA viruses, codons of the
X circular code were found to occur preferentially in the reading frame of the genes. This is an important property of viral genes, since it has been suggested that
X motifs at different locations in a gene may assist the ribosome to maintain and synchronize the reading frame [
21]. An initial evaluation test of the GOFIX method on a large set of 80 virus genomes [
15] showed that it achieves high sensitivity and specificity for the prediction of experimentally verified overlapping proteins (manuscript in preparation). A major advantage of our approach is that it requires only the sequence of the studied genome and does not rely on any homology information. This allows us to detect novel ORFs that are specific to a given lineage.
We applied GOFIX to study the SARS-CoV-2 genome and related SARS genomes, with a main focus on the accessory proteins. Using the extensive experimental data concerning the SARS-CoV genome and the expressed ORFs, we first show that the reading frames of the SARS-CoV ORFs are enriched in X motifs, including most of the overlapping accessory proteins. Exceptions include SARS-CoV ORF3b and ORF8b which may not be functional. Then, we use GOFIX to predict and compare putative genes in related genomes of SARS-like viruses from bat, civet and pangolin hosts as well as human SARS-CoV-2.
Discussion
Coronaviruses are complex genomes with high plasticity in terms of gene content. This feature is thought to contribute to their ability to adapt to specific hosts and to facilitate host shifts [
1]. It is therefore essential to characterize the coding potential of coronavirus genomes. Here, we used an ab initio approach to identify potential functional ORFs in the genomes of a set of representative SARS or SARS-like coronaviruses. Our method allows comprehensive annotation of all ORFs. Surprisingly, the calculation of
X motif enrichment is also accurate for the detection of overlapping genes, even though the codon usage and amino acid composition of overlapping genes is known to be significantly different from non-overlapping genes [
15].
We showed that the predictions made by the GOFIX method have high sensitivity and specificity compared to the known functional ORFs in the well characterized SARS-CoV. For example, the annotated ORFs that have been described previously as non-functional or redundant, notably ORF3b and ORF8b, are not predicted to be functional by GOFIX. In contrast, we identified a putative small ORF overlapping the RBD of the Spike protein in SARS-CoV, that is conserved in Civet-CoV and Bat-CoV strain WIV16. Protein sequence analysis predicts that this novel ORF codes for a double-membrane spanning protein.
We then used the method GOFIX to compare all putative ORFs in representative genomes, and showed that most are conserved in all genomes, including the structural proteins (S, E, M and N) and accessory proteins 3a, 6, 7a, 7b, 9b and 9c. However, a number of ORFs were predicted to be non-functional, notably ORF8b in SARS-CoV and ORF3b in all genomes. We also identified potential new ORFs, including ORF9d in Pangolin-CoV and ORF10 in all genomes.
Concerning SARS-CoV-2, to date, the coding potential of SARS-CoV-2 remains partially unknown, and distinct studies have provided different genome annotations [
37‐
39]. Overall, the genome of SARS-CoV-2 has 89% nucleotide identity with bat SARS-like-CoV (ZXC21) and 82% with that of human SARS-CoV [
40]. A recent annotation [
39] of the SARS-CoV and SARS-CoV-2 genomes identified 380 amino acid substitutions in 27 shared proteins, including the four structural proteins and eight accessory proteins, named 3a, 3b, p6, 7a, 7b, 8b, 9b and orf14 (corresponding to ORF9c here). Our analysis is in agreement with the previous studies showing that the genome organization is generally conserved. In particular, ORF9b and ORF9c are predicted to be expressed in SARS-CoV-2 genome. As expected, the structural proteins, S, E, M and N are conserved and have similar XME scores. ORF3a, ORF6 and ORF9b in SARS-CoV-2 also have similar XME scores to SARS-CoV.
Our ab initio analysis also allowed us to highlight some important specificities of the SARS-CoV-2 genome. Previously identified differences include some interferon antagonists and inflammasome activators encoded by SARS-CoV that are not conserved in SARS-CoV-2, in particular ORF8 in SARS-CoV-2 and ORF8a,b in SARS-CoV. Recent annotations of ORF3b are conflictual. For example, some authors have predicted that SARS-CoV-2 ORF3b is homologous to SARS-CoV ORF3b [
40], although the proposed SARS-CoV-2 protein is shorter with only 22 amino acids. In contrast, the SARS-CoV-2 ORF3b observed in [
13] is not coded by the same region as SARS-CoV-2 ORF3b and the protein sequence is completely different. Here, we show that ORF3b has 0
X motifs in SARS-CoV-2, in agreement with the fact that little expression was observed in recent experiments aimed at characterizing the functions of SARS-CoV-2 proteins [
13]. ORF10 is supposed to be unique to SARS-CoV-2, however it is also present in the Pangolin-CoV genome and its origin can be traced back to the Bat-CoV, where a truncated ORF of 26 amino acids, also present in the civet and human SARS-CoV genomes, can be found. Here, we observe that ORF7a, ORF7b and ORF9c have reduced XME scores in SARS-CoV-2. It remains to be seen whether these differences reflect functional divergences between SARS-CoV and SARS-CoV-2.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.