Background
We have reported the DNA sequence and genomic annotation of a novel large genome bacteriophage named
Bacillus thuringiensis phage 0305φ8-36 [
1,
2]. Phage 0305φ8-36 was isolated from soil while targeting the isolation of large, unusual phages of unsampled or undersampled types [
3‐
6]. Examination of phage 0305φ8-36 by electron microscopy revealed an unusually long contractile tail, and three large corkscrew shaped fibers emanating from the upper aspect of the baseplate [
4]. The genes of 0305φ8-36 have only distant homologues and the gene for the large terminase subunit was reported to be anciently derived [
4]. Among the functionally annotated gene products [
1,
2] are a putative RNA polymerase, DNA polymerase III and associated replicative and metabolic enzymes, two DNA primases, and virion proteins. A thorough survey by mass spectrometry identified 55 virion protein-encoding genes, and noted that this was an excess over the prototypical myovirus, T4, and particularly so if tabulated in terms of the total length and hence complexity of virion protein sequence.
The closest homologues of most of the virion protein-encoding genes and a few replicative genes were found to reside in a single segment of the chromosome of
B. thuringiensis serovar
israelensis. A smaller segment also appears in the chromosome of a closely related species,
B. weihenstephanensis. These two phage-like regions are termed BtI1 and BwK1, respectively [
1]. In this report, a detailed study is made of the genomic organization and vertical descent of phage 0305φ8-36 in comparison with BtI1/BwK1.
A central problem in comparative genomics analysis is to reconcile the high incidence of horizontal exchanges [
7‐
10] with the observation of conserved gene organization [
11]. Some elements of gene order in the genes encoding virion proteins appear to have been conserved in many widely different types of tailed phages, despite these phages being anciently related [
12]. The most commonly observed organization of phage genes, includes 1) a conserved order of genes within a head structure and morphogenesis module, and 2) a conserved order of modules for head, tail, baseplate, and tail fiber proteins [
11]. This most frequent organization is not found in all phages. In particular, T4 encodes its virion proteins in several genomic segments interspersed with non-virion genes, although functional clustering persists within the segments [
13]. The implications of gene order for annotating other large myoviral genomes has been discussed [
14]. Phage 0305φ8-36 conforms to this relatively common gene organization in most respects, but it has novel genes implicated in curly fiber formation placed on both sides of the head structure module [
1].
A relatively strong conservation of gene organization implies a relatively light load of horizontal transfers. Phage 0305φ8-36 lacks genes recently transferred from other known phage or bacterial genomes [
1]. T4-like phages share this feature, and are therefore a useful model for analyzing 0305φ8-36. The T4 genome organization was found to be substantially conserved over a very long time [
15,
16]. This supports the proposition that obligatory lytic phages may be less prone to horizontal transfer and hence less prone to reorganization of their genome plan than are temperate phages [
17,
18]. An expectation of a particular gene order can be valuable in hypothesizing functional assignments for genes that have diverged beyond easy recognition. This becomes especially true now that there are more elaborate comparative methods to follow up on such a hypothesis. For example, we have demonstrated a strategy of using gene order in combination with weak Blast scores to propose a distant homology, and then following up with comparison of predicted secondary structures [
19].
To positionally evaluate weak blast matches in a systematic way across the 0305φ8-36 genome, this study used a computational method that presents its results through the graphics display program Gbrowse [
20]. This allowed definition of insertions and deletions (indels) relating 0305φ8-36 and BtI1/BwK1 down to the domain level, and a visual collation of the results with the distribution of other 0305φ8-36 features. One of the major sources of confusion in achieving a totally automated comparison of genomes was the incidence of paralogues. It was found to be most useful to find the paralogues first as part of the basic Psi-Blast searches for each gene and to represent them within the same graphics display as the chains of 0305φ8-36 versus BtI1/BwK1 Blast matches.
Using these and other comparative techniques, we found that between 0305φ8-36 and BtI1/BwK1 there was an extensive conservation of gene order among the virion protein-encoding genes. This was in spite of numerous large and small insertions or deletions interspersed with the conserved matches. The time over which this arrangement persisted was estimated to be 2 – 2.5 billion years (Byr). Within this conserved framework, several multigene modules encoding virion proteins have apparently inserted. The content of genes encoding virion proteins in these modules accounts for the greater complexity of virion proteins compared to other myoviruses, e.g. T4. Finally, an evolutionary scenario for the creation of the overall 0305φ8-36 genome plan is explored in which two ancestral phages are fused and then resolved to a single genome plan which still contains remnants of both replication systems.
Methods
Figure
1 was the output of graphics display program Gbrowse [
20] dynamically linked to a locally maintained annotation database for this phage. Figure
2 was modified from the output of a program, b36chain, written in the course of this study to add comparative genomic data to the Gbrowse display. Two different methods were employed for incorporating positional information to highly divergent homologues as follows. Program b36chain conducts the equivalent of a Tblastn search between the genome under analysis and a genome selected on the basis of one or more significant matches from a standard database search. There is an inherent improvement in sensitivity because the E values used to reject chance matches will have been recalculated based on the size of the subject genome rather than on the size of the entire nr database. The results are collated by position on both genomes. The E-values of weak matches were then further improved by 3/219 (window length/genome length) if they fell within a 3 kb window around the same established match on both genomes. Matches thus elevated beyond a threshold of significance were then treated as established matches for evaluation of the next 3 kb interval. The program then produced a gff file which directs Gbrowse to add a track with a chain of glyphs representing matches found in the same orientation and order in both genomes and within 3 kb of the same spacing. The program inserts connector glyphs representing insertions or deletions between the matches and scales these to the amount of DNA gained or lost. In the case of conflicting geometry, multiple chains are drawn representing the alternative alignments of the matches. An additional track is also provided reporting the coordinates of each match on the subject and target genomes before positional filtering (not shown). That track defined how the coordinates of the subject genome must be folded to align with 0305φ8-36 coordinates for use in the second method described below. The unfiltered track shows in this case that there are not plausible divergent relationships other than the indicated matches found in order. The image produced was hand edited to resolve alternative alignments due to repeated sequences in conjunction with creation of a paralogous domain track. This method, represented by the darkest shade of red in Figure
2 is annotation independent. It is nearly completely automated and does not require prior prediction of frames on either genome.
The second method to incorporate positional information made use of the annotation for both genomes. Some improvements in BtI1 start codon positions and some additional unannotated BtI1 genes were also incorporated. Annotated frames in each genome that were aligned but not matched by the annotation independent method were subjected to a BlastP search by the "Blast 2 sequences" service at NCBI. The E value for acceptance was arbitrarily increased to a maximum value that still excluded random matches appearing off diagonal on the dot plot of the output. On-diagonal matches were then included in Figure
2 at the second level of red. This method is called the "annotation-dependent" method in the text, and is not automated at this time.
Secondary structure prediction and HMM modelling with SAM were as described [
1]. Charge distribution was calculated at the Statistical Analysis of Protein Sequence web server [
45,
46]. Figure
4 was derived from logos created by the SAM makelogo utility after incorporation of prior information by the w0.5 utility [
47]. Figure
5 was derived from a sequence logo created at Pfam [
48]. Figure
6 was produced at the WebLogo web server [
49,
50].
Competing interests
The author(s) declare that they have no competing interests.
Authors' contributions
SCH designed the study, performed informatic analysis with respect to genomic organization, and wrote the paper. JAT performed informatics analysis with respect to functional gene assignment and wrote portions of the paper. PS participated in the design and coordination of the study and helped draft the manuscript.