Background
The last several years have witnessed major advances in our understanding of the diversity of the global virome (the entirety of viruses in the biosphere) and the evolutionary relationships among viruses and virus-like mobile genetic elements (MGE). As of late, the principal methodology for discovery of new viruses has dramatically changed: instead of the traditional virus isolation and cultivation, the great majority of new viruses are now discovered through metagenomic sequence analysis [
1‐
7]. Some of these studies drastically change the existing knowledge on the virome compositions in various habitats. One of the most striking cases in point is the discovery of crAssphage, by far the most abundant virus in the human intestine and generally, in the human-associated virome that remained unknown until the advent of metagenomics [
8‐
10]. At the time of the initial discovery, most of the crAssphage genes remained uncharacterized, and no related viruses have been identified, so that, despite its ubiquity and high abundance in humans, this virus remained completely enigmatic and recalcitrant to further investigation [
9]. However, a follow up study taking advantage of expanded databases and sensitive homology detection methods has led to the identification of an expansive family of bacteriophages all of which are predicted to infect bacteria of the phylum Bacteroidetes [
11]. The main structural and replication genes of these viruses have been identified, making them amenable to experimental characterization.
Metagenomic sequence mining has led to the discovery of previously unrecognized groups of viruses that apparently infect uncultivated bacteria and archaea, and are likely to be important ecological players. An example is a novel family of viruses associated with uncultivated Group II marine archaea, where both the hosts and the viruses appear to be among the most common members of the ocean that would remain obscure without the metagenomic effort [
12,
13]. Other metagenomic studies have drastically changed the status of certain groups of viruses that previously have been considered minor components of the virosphere. In particular, metagenomic analyses have revealed enormous, unsuspected diversity and abundance of single-stranded (ss) DNA viruses [
14‐
18]. These are only a few of the metagenomic discoveries which collectively indicate that traditional methods for virus isolation have only scratched the surface of the virosphere, whereas the actual diversity and structure of the global virome can be characterized only by comprehensive metagenomic analyses. In recognition of this sea change in virus research, the International Committee for Taxonomy of Viruses (ICTV) is now accepting proposals for new virus species and higher taxa on the basis of metagenomic sequences alone [
7].
Parallel to the advances in metagenomics, and in large part, fueled by metagenomic discoveries, there has been considerable progress in the reconstruction of virus evolution. The major emerging trend is the ultimate modularity of virus evolution whereby evolutionarily coherent structural and replication modules combine promiscuously with one another and with a variety of additional genes [
19‐
21]. One of the most notable cases in point are the ssDNA viruses that appear to have evolved on multiple occasions via independent recombination events between a capsid protein gene derived from RNA viruses and a plasmid replication module [
20,
22,
23]. A complete reconstruction of virus evolution is feasible only through phylogenomic analysis of both the structural and the replication modules, along with the recombination events [
24]. In practice, the structural module is often the best marker of virus evolution because the structural genes seem to be exchanged or eliminated less often than replication genes, and hence provide for unification of more diverse groups of viruses [
25,
26].
The great majority of the double-stranded (ds) DNA viruses with moderate-sized and large genomes can be partitioned into two vast supergroups with distinct, unrelated structural modules [
21]. The robustness of the two groups has been validated quantitively by analysis of bipartite, gene-genome networks [
27]. The first supergroup includes most of the known head-tail bacteriophages (order
Caudovirales), a variety of phage-like viruses infecting mesophilic archaea, and the animal viruses of the order
Herpesvirales. All these viruses possess icosahedral particles formed by the so called HK97 fold (named after the eponymous bacteriophage) capsid protein and a two-subunit terminase that mediates ATP-dependent DNA packaging into the capsid. The second supergroup consists of two families of bacteriophages (
Tectiviridae and
Corticoviridae) [
28], archaeal viruses of the family
Turriviridae [
29] and many diverse groups of eukaryotic viruses including giant eukaryotic viruses of the putative order “Megavirales” [
30]. All these viruses also possess icosahedral capsids that, however, are built of the double jelly-roll major capsid protein (DJR MCP [
31,
32]) that is unrelated to the HK97 capsid protein, typically, accompanied by a single jelly roll minor capsid protein. Furthermore, these viruses employ a distinct ATPase that belongs to the FtsK-HerA superfamily of P-loop NTPases [
33] and is unrelated to the terminase, for DNA packaging.
The two major supergroups of dsDNA viruses strongly differ with respect to the representation of viruses infecting prokaryotes and eukaryotes. The HK97 supergroup consists primarily of prokaryotic viruses, the tailed phages that represent a substantial majority among all known viruses. By contrast, viruses of eukaryotes are represented by a single, even if expansive, order Herpesvirales, with representatives so far detected only in animals. In contrast, viruses of the DJR MCP supergroup attained remarkable diversity in eukaryotes but are only sparsely represented among the known viruses of prokaryotes. We sought to explore the actual expanse of the DJR MCP group among prokaryotes by searching genomic and metagenomic databases for homologs of the tectivirus-like MCP using sensitive sequence analysis methods. In genomes and metagenomes from various environments, we discovered numerous, highly diverse DJR MCP-encoding sequences in variable genomic contexts. Analysis of these sequences revealed several groups of previously unknown viruses and proviruses that show extreme plasticity of gene repertoires and genome organizations.
Discussion
The results of this work are not unexpected in the sense that they are fully compatible with the notion of accelerating expansion of the virosphere thanks to genome and metagenome mining efforts [
76]. It seems that, with the advances of metagenomics, an exhaustive search for distant relatives of any known group of viruses or for completely new groups is bound to reveal previously unsuspected diversity. Here we expanded the previously limited diversity and host range of small dsDNA (and in the case of the FLiP group, ssDNA) viruses of prokaryotes with icosahedral virions composed of DJR MCP. These findings restore the balance between viruses of prokaryotes and eukaryotes in the DJR MCP supergroup by demonstrating the wide spread of these viruses in prokaryotes. Although these viruses appear to be less abundant than the HK97 supergroup, their diversity, association with various hosts and presence in many environments revealed by the present analysis suggest that they comprise a substantial component of the prokaryotic virosphere and might be important ecological agents. While this manuscript was in review, a study describing a new family of tailless bacteriophages encoding a DJR MCP and denoted “Autolykiviridae” has been published [
77]. The autolykiviruses have been identified as the principal killers of the Vibrionaceae bacteria in the ocean, indicating a much greater ecological impact of tailless bacteriophages than previously suspected. Notably, all members of this putative new family belong to the PM2 group, one of the 6 groups of DJR MCP viruses of prokaryotes described here. In the phylogenetic tree of the MCP for the PM2 group, the autolykiviruses form one tight clade among several (Additional file
3), emphasizing the diversity of the prokaryotic DJR MCP viruses of which the new family is but a small part. Taken together, these findings reveal an unexpectedly wide spread of the DJR MCP class of viruses in the biosphere the full extent of which remains to be assessed.
The search for new viruses described here was performed using a straightforward approach, namely, searching the genomic and metagenomics databases for MCP homologs. It should be emphasized, however, that using the most sensitive and specific of the available database search approaches is critical for the success of such efforts. The majority of the putative viruses and proviruses that we describe here could be identified only when a manually curated alignment of MCP was used as the query for database search. Furthermore, many other proteins of the identified viruses are also highly diverged, making manual curation a must for robust genome analysis.
The putative viruses identified here with the single MCP probe show consistency in terms of genome size: all have some of the smallest genomes among dsDNA viruses, roughly, between 7 and 18 kb. These elements also encode varying assortments of proteins from a characteristic virus gene pool, the most prominent being the packaging ATPase of the FtsK-HerA superfamily. This consistency of genomic features suggests that the prokaryotic DJR MCP-encoding viruses occupy a distinct part of the virosphere. However, the other side of the coin is the extreme plasticity of the gene repertoires of these viruses. Not a single gene other than MCP that is present by design of our search protocol is shared by all genomes. Instead, within each of the defined virus groups, proteins from the replication, integration and lysis modules are recurrently replaced with functionally equivalent counterparts. Even the packaging ATPase that is one of the most stable functional partners of the DJR MCP is missing in the putative viruses of the Odin and FLiP groups, suggesting a distinct packaging mechanism. The viruses of the newly discovered Odin group have the smallest genomes in the DJR MCP supergroup, comparable in size to the genomes of the smallest dsDNA viruses of eukaryotes (polyoma/papillomaviruses) and ssDNA viruses with single jelly roll MCPs that lack a dedicated virus-encoded packaging ATPase. Many of these viruses appear to assemble the capsid around the viral genome [
78,
79] instead of packaging the DNA into a preformed, empty capsid in an ATP-dependent fashion as dsDNA viruses with larger genomes do [
80]. A similar mechanism might be operative in the viruses of the Odin group as well as the FLiP group. More generally, the present findings emphasize the enormous evolutionary plasticity of viruses that can completely change the gene repertoire while retaining the same capsid structure and similar genome size. Parallel findings have been reported previously as a result of a search for eukaryotic DJR MCP viruses resembling the polinton class transposons [
72,
81], suggesting that such plasticity is a general, still under-appreciated trend in the evolution of the virus world. It is our hope that the present analysis stimulates experimental characterization of some of the viruses identified here which will shed light on virus biology.