Main

Viruses are obligate intracellular parasites that probably infect all cellular forms of life. Although virologists have traditionally focused on viruses that cause disease in humans, domestic animals and crops, the recent advances in metagenomic sequencing, in particular high-throughput sequencing of environmental samples, have revealed a staggeringly large virome everywhere in the biosphere. At least 1031 virus particles exist globally at any given time in most environments, including marine and freshwater habitats and metazoan gastrointestinal tracts, in which the number of detectable virus particles exceeds the number of cells by 10–100-fold1,2,3,4,5. To help conceptualize the sheer number of viruses in existence, their current biomass has been estimated to equal that of 75 million blue whales (approximately 200 million tonnes) and, if placed end to end, the collective length of their virions would span 65 galaxies6. In addition to their remarkable abundance, viruses are spectacularly diverse in the nature and organization of their genetic material, gene sequences and encoded proteins, replication mechanisms, and interactions with their cellular hosts, whether they are antagonistic, commensal or mutualistic7. Aquatic environments contain particularly diverse forms of viruses, including single-stranded (ss) and double-stranded (ds) DNA and RNA viruses with genomes that range in size from less than 2,000 bases to more than 2 million bases4. Although dsDNA viruses that infect bacteria (bacteriophages) are the best studied to date, recent work suggests that around 50% of marine viruses have ssDNA or RNA genomes8.

Metagenomic data are changing our views on virus diversity and are therefore challenging the way in which we recognize and classify viruses9. Historically, the description and classification of a new virus by the International Committee on Taxonomy of Viruses (ICTV) have required substantial information on host range, replication cycle, and the structure and properties of virus particles, which were then used to define groups of viruses. However, high-throughput sequencing and metagenomic approaches have radically changed virology, with many more viruses now known solely from sequence data than have been characterized experimentally. For example, the family Genomoviridae currently comprises a single classified virus, whereas more than 120 possible members have been sequenced from diverse environments. However, these sequenced viruses lack information about their hosts and other biological properties that would guide their assignment into species and genera in the family10. Indeed, vast numbers of complete, or nearly complete, genome sequences have been assembled and characterized from metagenomic data for viruses with small11,12,13,14, medium15,16,17,18 and even large19,20 genomes. The identification of entirely new groups of viruses from such analyses emphasizes the power of metagenomic approaches in discovering viruses, some of which could have key functions in the regulation of ecosystems, whereas others could coexist with their hosts without causing recognizable disease or may even be mutualists7. However, realistically, few of these viruses are ever likely to receive the same level of experimental characterization as pathogens that cause human disease or influence the global economy.

The question of whether viruses that are identified by metagenomics can, and should, be incorporated into the official ICTV taxonomy scheme on the basis of sequence data alone is pressing. In response to this question, a workshop of invited experts in the field of virus discovery and environmental surveillance, and members of the ICTV Executive Committee, took place in June 2016 to discuss this possibility and to develop a framework for appropriate approaches to virus classification. We present these proposals in this Consensus Statement article, together with an explanation of the rationale for their development. Our proposals have been subsequently endorsed by the ICTV Executive Committee.

Virus diversity

The discrepancy between the number of potential taxa into which viruses in environmental samples could be classified and the number currently recognized by the ICTV is striking. A recent analysis of dsDNA virus sequences that were characterized as part of the Tara Oceans expedition from 43 surface ocean sites worldwide identified 5,476 distinct dsDNA virus populations21, but only 39 of these corresponded to virus groups that have been classified by the ICTV. Most of these populations were both abundant and widely dispersed geographically, but almost all fell outside of established viral taxa (Fig. 1). Early virome studies from different marine habitats hinted at this huge diversity22,23, and, although sequencing technologies at the time precluded direct genome-wide characterization, mathematical modelling predicted several hundred thousand distinct DNA viral genotypes. A recent comprehensive metagenomic analysis of thousands of diverse samples has led to the discovery of approximately 125,000 new viral genomes and a 16-fold increase in the number of identified viral genes24. Similarly, as technology advances, it is becoming clear that ssDNA and RNA viruses in marine and other ecosystems are far more diverse than currently characterized viruses; however, these new viruses remain understudied despite their ecological importance11,25,26,27,28,29,30,31. Many ssDNA viruses identified in metagenomic data encode an evolutionarily conserved replication-associated protein (Rep), whereas the number, orientation and evolutionary origin of other genes are highly variable in these circular Rep-encoding ssDNA viruses (CRESS-DNA viruses)32. Phylogenetic analyses have revealed distinct clustering of some of these viruses into four recognized families, in addition to a vast range of viruses that fall outside of these clusters (Fig. 2). Aside from marine environments, most viruses discovered in wild plants through metagenomics seem to be persistent, and only a tiny proportion of these viruses are species that are recognized by the ICTV33. Highly diverse novel viruses have been similarly reported from insects34,35, and several eukaryotic and prokaryotic viruses have been identified in terrestrial environmental samples24,36.

Figure 1: Prevalence, abundance and affiliation of marine viruses.
figure 1

The 15,222 virus populations that were identified across the Global Ocean Viromes (GOV) dataset69 are shown according to their prevalence (x-axis, number of sampling stations in which the population was detected), average abundance (y-axis, log10 scale, average of normalized coverage across all samples in which the population was detected), and are coloured by the taxonomic affiliation of their host (affiliation is based on best basic local alignment search tool (BLAST) hit of predicted genes; a population was associated to a virus isolate and its host when ≥50% of predicted genes were affiliated to this virus isolate; 512 of the 15,222 populations could be affiliated). Figure courtesy of S. Roux and M.B.S., The Ohio State University, USA.

PowerPoint slide

Figure 2: Genetic diversity of CRESS-DNA viruses.
figure 2

The replication-associated protein (Rep) sequences of 659 circular Rep-encoding single-stranded DNA (ssDNA) viruses (CRESS-DNA viruses) were compared with 10 representative Repsequences from viruses classified in the families Geminiviridae, Nanoviridae, Circoviridae and Genomoviridae, and a group of alpha satellites that are associated with geminiviruses or nanoviruses. Amino acid sequences were aligned using Multiple Alignment using Fast Fourier Transform (MAFFT; G-INS-i option)70, and a maximum likelihood phylogenetic tree was constructed using Fasttree71. Branches with less than 50% SH (Shimodaira–Hasegawa)-like support were collapsed.

PowerPoint slide

Metagenomic studies have also uncovered astonishingly abundant novel viruses in the human gastrointestinal tract that, despite decades of research, had not been detected previously. For example, the 97 kb genome of a dsDNA bacteriophage, named crAssphage, is six-times more abundant in publicly available metagenomic datasets from sewage or wastewater samples than all other known bacteriophages combined. This virus contributes up to 90% of all sequence reads in virus-like particle-derived metagenomes and accounts for 1.7% of all human faecal metagenomic sequence reads in public databases17.

Furthermore, numerous viruses are hidden in publicly available microbial genomic datasets. A recently developed tool, VirSorter37,38, identified 12,498 new viral genome sequences in 15,000 bacterial and archaeal genomes37, which increased the number of known prokaryotic viruses 10-fold and identified viruses that infect 13 prokaryotic phyla37,38. These advances are a striking testimony to the fundamental change in virus discovery: the overwhelming majority of new viral genomes now come from metagenomic data and have never been directly linked to biological agents. Virologists, especially viral taxonomists, have no choice but to work within this new reality.

Current taxonomy of viruses

The framework that is provided by taxonomy enhances our understanding of viruses. It helps communication among virologists, and between virologists and other stakeholders, such as farmers, growers, regulators and potential funders. However, the taxonomy of viruses differs in some fundamental aspects from that of cellular life forms. In particular, viruses lack universal genes that can be used to construct a unified phylogeny into which all viruses can be placed39,40,41,42. Therefore, there is no viral equivalent to the cellular tree of life that has been established through comparisons of ribosomal RNA and (nearly) universal protein-coding genes in bacteria, archaea and eukaryotes (notwithstanding the complications that are caused by horizontal gene transfer)43,44,45.

The ICTV is solely responsible for the classification of viruses into taxa and naming them. Currently, classified viruses are assigned to the hierarchical ranks of family, genus and species, and each taxon has a defined, unique and regulated name. Some families are also divided into subfamilies that each contain separate genera, and a minority of families are also assigned to the higher taxon of order. The ICTV disseminates information on virus taxonomy through the master species list (MSL), which currently lists 7 orders, 112 families, 610 genera and 3,704 species46 (see Virus Taxonomy: 2015 Release), and through periodic publication of ICTV reports that contain additional descriptive material47. The MSL is updated annually based on the submission of taxonomic proposals to the ICTV Executive Committee (see current ICTV Executive Committee webpage), mostly by specialized study groups (see ICTV Study Groups). These proposals are made available to the public and are then scrutinized by the ICTV Executive Committee for compliance with a minimal set of rules that are laid out in the International Code of Virus Classification and Nomenclature (ICVCN; see International Code of Virus Classification and Nomenclature webpage), and for the robustness of the supporting evidence. The new taxonomy is then ratified by voting members of the ICTV and incorporated into the MSL annually.

The lowest taxonomic rank is that of species, which is defined in the ICVCN as “a monophyletic group of viruses whose properties can be distinguished from those of other species by multiple criteria”. Historically, the term “multiple criteria” has been interpreted as referring to attributes such as replication properties in cell culture, virion morphology, serology, nucleic acid sequence, host range, pathogenicity, and epidemiology or epizootiology. However, there is considerable variation in the way in which these criteria have been applied to viruses in different families by the respective Study Groups and approved by the ICTV.

The ICVCN provides greater freedom for specifying the higher taxonomic ranks, with a genus defined as “a group of species sharing certain common characters”, a family defined as “a group of genera (whether or not these are organized into subfamilies) sharing certain common characters” and an order defined as “a group of families sharing certain common characters”. These looser criteria accommodate the substantial variation in the way in which they are applied among the higher ranks. As an approximate guide for vertebrate and plant viruses, members of different genera in a family typically have similar genome organizations with homologous structural and replication-associated genes, but often have non-homologous accessory genes, such as those that are involved in the evasion of host defence and in viral movement in plants. By contrast, between families, viruses often have completely different genome organizations and may lack any detectable genetic relatedness. The presence of homologous, even if not closely similar, RNA-dependent RNA polymerases (RdRps), proteases and helicases in RNA viruses, and Rep-encoding genes in small ssDNA viruses, may, however, enable distant evolutionary relationships between virus families to be identified; such relationships may form a basis for the creation of orders. The process of identifying such distant relationships and assessing their appropriateness for higher rank taxonomic classification is not trivial, and, consequently, the creation of orders requires particularly careful consideration. For example, the existence of a substantial set of shared genes in diverse large or giant dsDNA viruses of eukaryotes has prompted a proposal for the creation of the order 'Megavirales' (Ref. 48), which has thus far not been accepted by the ICTV owing to the lack of consensus in the field. Similarly, the creation of an order for the CRESS-DNA viruses is currently being considered by the relevant ICTV Study Groups.

Virus taxonomy in the age of metagenomics

In the past, the approval of a new species by the ICTV was typically dependent on the availability of data that demonstrate the distinct biological characteristics of the respective virus. This requirement has limited the number of viruses that have been classified and incorporated into the MSL. As most viruses are now discovered by metagenomics and lack direct correlation with biological agents, a workshop was convened to develop a new framework for virus taxonomy in the era of metagenomics (Box 1; Supplementary information S1 (box)). The discussions at the workshop reflected the fact that the challenges that are posed by metagenomic data are not unique to viruses (Box 2).

Sequence assemblies that are derived from environmental samples often contain complete, verified genome sequences of new viruses, but do not directly provide information on biological properties. This perceived limitation has raised the concern that virus classification based on sequence information alone would result in a taxonomy of sequences rather than of viruses49. However, with appropriate precautions (see below), we believe that the detection of a viral sequence in a sample is sufficient evidence to infer the existence of the corresponding virus. Indeed, the concept that a virus can be detected, characterized and classified entirely through analysis of its sequence has gained traction in the burgeoning field of virus discovery. Given that the properties of a virus are largely, or entirely, encoded by its genome, it follows that virus classification based on sequence information alone is not limited primarily by the absence of biological attributes, but by our inability to accurately read such information and robustly infer enzymatic functions, virion structure and other phenotypic attributes.

Sequence data provide a wealth of information that can be used for the purposes of taxonomy, such as evolutionary relationships, overall genome organization (gene content and order, prediction of encoded proteins and the presence of characteristic repeated sequences), features of genome expression, genome replication strategy, the presence or absence of various distinctive motifs (for example, polyprotein cleavage sites, internal ribosome entry sites, terminal sequences, structural folds and host range determinants50), and features of global and local genome composition (for example, GC content, dinucleotide frequencies51 and codon usage). Sequence analyses could thus provide the 'multiple criteria' that are required for classification into species. Indeed, the successful use of sequence information in virus classification has been foreshadowed in the pre-metagenomic era. For example, the bioinformatic characterization of cloned sequences was responsible for the discovery of hepatitis C virus, the prediction of its properties and replication strategy, the characterization of its similarity to members of the family Flaviviridae, and the development of effective diagnostic and screening assays52,53; such advances preceded the visualization of virus particles, the detection of viral proteins in vivo and the achievement of viral growth in cell culture by many years.

However, it is important to recognize that there are several technical problems with using viral genomes that are assembled from metagenomic datasets for taxonomy. Such sequences are often derived from mixed virus populations and, consequently, there is a risk of assembling artificially chimeric genomes. Furthermore, current methodologies are unsuitable for assembling complete genome sequences from viruses that have segmented or multipartite genomes. Another practical problem arises from virus-derived sequences that are integrated into host genomes (for example, endogenous virus-like elements and prophages), many of which are transcribed and hence are present in the RNA pool. To use metagenomic sequences for classification, these problems need to be addressed by robust computational and experimental methods. However, these caveats do not represent fundamental barriers to virus classification, as the technology that is used to create metagenomic sequences is improving continuously, and many of the problems, particularly those that are associated with de novo assembly, will be resolved. These improvements include methods that generate longer sequence reads and those that use template circularization to decrease error rates54.

Proposals

The workshop reached a consensus view on classifying viruses solely on the basis of metagenomic sequence data and, consequently, developed a set of proposals (Box 1; Supplementary information S1 (box)). These proposals are diagrammatically summarized in Fig. 3.

Figure 3: Summary of the proposed classification pipeline.
figure 3

The proposed classification pipeline (red arrows) enables both metagenomic sequence data and conventionally derived virus sequences to be classified. Inferred biological properties that are obtained by bioinformatic analysis of virus sequences together with information on sequence relatedness and gene content, and, optionally, any observed biological properties (dotted line), may all be used as defining criteria for species and higher rank taxonomic assignment in the International Committee on Taxonomy of Viruses (ICTV) taxonomy. This procedure differs from current (green arrows) and previous practice (blue arrows), in which biological data and/or host information and sequence data (current), or biological data alone (1970s–1990s), were required for classification.

PowerPoint slide

Basis of classification. Classifying viruses that are identified only from metagenomic data will advance virus taxonomy, dependent on appropriate checks on data integrity and following the standard procedures of assignment. This is expected to involve the creation of higher rank taxa that consist entirely of viruses that are identified from metagenomic sequence data.

Creating new species. The current ICTV species definition suffices for the classification of viruses based only on sequence information. Virus characteristics that can be inferred from sequence data, including genome organization, replication strategy, presence of homologous genes, and, potentially, host range or type of vector, may serve as additional biological characteristics. These may be used to delineate species in the absence of phenotypic data that have often been relied on for existing species definitions. Such information is best inferred from genomic sequences that comprise the complete coding potential of the respective virus and should be a minimum requirement for classification based on sequences alone.

Assigning new species and genera to existing families. Demarcation procedures vary widely between virus groups and are typically based on parameters that include sequence-based phylogeny and various biological attributes. Although recognizing that direct biological information may form a part of the definition of existing taxa, viruses that are identified from metagenomic data can be classified into additional taxa (species and genera) if their sequence relationships are comparable to those among existing taxa in that family.

Delineating new families and orders. Viruses that have genome sequences that lack close relationships to viruses in existing taxa pose a particular problem, as there is no phenotypically derived standard by which they can be classified. In this situation, assignment of a virus to a new family could be based on limited or absent genetic homology to viruses in recognized families and the existence of major differences in genome organization or inferred replication strategy. Clustering and patterns of variation among more closely related metagenomic sequences might be used to assign viruses hierarchically to lower taxonomic ranks in such groups. However, the creation of a new family, and the assignment of genera and species within it, would require a considerable amount of sequence information and the development of a sound classification framework that is capable of accommodating it. Formalized clustering and network analysis methods that create similarity metrics that are based on the detection of homologous genes and their genetic divergence55,56,57 could be valuable for taxonomic assignments and should be critically evaluated for their effectiveness in the development of a robust classification approach. Frameworks of this kind may have to be tailored to the virus group. For example, bacteriophage taxonomy is typically based on virion sequence and structure58, but these characteristics may not be appropriate for the classification of animal and plant RNA viruses, in which deeper relationships are most often apparent in the gene sequences of the RNA polymerase and other conserved replication-associated proteins59.

Nomenclature of taxa identified only from sequence data. The system that is currently used by the ICTV for taxon nomenclature is readily extendable to additional species, genera and families that are created from metagenomic sequence data. Furthermore, taxa may contain viruses that were identified by various methods. Hence, a species that initially comprises viruses that are characterized solely from sequence data could eventually include viruses that are identified by isolation and that have directly defined biological properties. Thus, metagenomic status belongs to, and would be recoverable from, the sequence record for a particular virus and not to the entire taxon to which it is assigned. Although some virologists have adopted the term 'associated' as part of the nomenclature of viruses that were identified in metagenomics datasets (for example, human stool-associated circular virus (GQ404856 (Ref. 60)); for other examples see Refs 12,13,26,61), it is unnecessary to incorporate this or other such terms that are equivalent to the bacterial term 'Candidatus' into virus taxon names.

Improvement of the procedure for the classification of viruses. The current process of submitting taxonomic proposals to the ICTV suffices, in principle, for dealing with viruses that are known only from sequence data. However, the process could be substantially improved and streamlined through the development of electronic submission methods that incorporate appropriate quality checks for accuracy and completeness of data. In particular, the format could be modified to enable numerous species (possibly many hundreds or thousands) to be proposed in the same submission without the unnecessary repetition of information. In addition, procedures could be developed that shorten the time that is required for processing proposals and updating the MSL.

ICTV endorsement. As an important initial step towards metagenomics-based virus taxonomy, the proposals that were developed during the workshop were presented to, and discussed at, the ICTV Executive Committee meeting from 22–24 August 2016. The proposals were supported by all members of the Executive Committee that were present (one member was unavoidably absent but has since expressed support) and their practical implementation was seen as a matter of high priority for the ICTV. This process will include actively inviting the virology community to submit taxonomic proposals that are based on metagenomic sequences, providing guidelines on data standards (including sequence quality and completeness) and developing more effective data submission tools for large sequence datasets. The ICTV Executive Committee plans to explain and develop these steps in a separate article.

Conclusions

We believe that the time has come to advance the philosophy and practice of virus taxonomy by admitting viruses that are identified only from metagenomics data as being bona fide viruses, dependent on appropriate checks on data integrity and following the standard procedures of taxonomic assignment. We expect that this process will lead to the imminent creation of higher rank taxa that consist entirely of viruses identified by metagenomics.

We believe that the implementation of the proposals outlined here will enable the creation of a vastly expanded formal taxonomy for viruses that will be a major contribution to future research on virus diversity. Only by accepting that sequences that are generated by metagenomic methods truly represent existing viruses and by including them in classification schemes, can we hope to better understand the ecology, history and impact of the global virome.