Introduction

Positive-sense single-stranded RNA (+ssRNA) viruses include highly virulent human pathogens. Within the Flaviviridae family, the genus Flavivirus comprises more than 70 +ssRNA viruses [1], including dengue virus (DENV). Dengue, an acute viral disease transmitted by mosquito, is one of the most widespread vector-borne viral diseases in humans. Dengue is caused by any of four antigenically distinct serotypes: dengue-1 (DENV1), dengue-2 (DENV2), dengue-3 (DENV3), and dengue-4 (DENV4). There are estimated 50–100 million cases of dengue fever annually worldwide, half a million of which result in severe forms of the disease, dengue hemorrhagic fever and dengue shock syndrome [2]. Generally, infection with one serotype confers future protective immunity against that particular serotype, but not against the others. In fact, dengue hemorrhagic fever may occur from sequential infection by different virus serotypes in a process called antibody-mediated disease enhancement, where antibodies raised against the first serotype enhance infection with the second serotype [3].

The major envelope glycoprotein (referred to as E protein hereafter) of DENV and of other flaviviruses is responsible for important phenotypic and immunogenic properties of the virion and is believed to lead the virus entry into cells [4, 5]. The E protein mediates virus assembly and virus–cell membrane fusion, and initiates infection through binding to cell surfaces. This protein is the principal component of the external surface of the DENV virion and represents the dominant virus antigen, evoking protective immune responses. Dengue serotypes can be distinguished by virus-neutralizing antibodies, but non-neutralizing antibodies against the E protein are cross-reactive. These non-neutralizing antibodies may help bring the virion into close proximity to the normal virus receptor, thus enhancing virus binding and increasing the number of infected cells, with concomitant exacerbation of the disease [6].

While it is well established that the E protein is one of the major proteins responsible for the pathogenicity and immunogenic properties of flaviviruses, the exact residues/regions responsible for these traits remain to be identified. Single-residue substitutions mapped to different parts of the E protein were reported to cause flavivirus attenuation [7], implying that several residues within the E protein are responsible for phenotypic and pathogenic properties.

The crystal structure of the soluble ectodomain of DENV2 E protein reveals a hydrophobic pocket lined by residues that influence the pH threshold for membrane fusion [810]. The protein has three structural domains (DI, DII, DIII) that map closely to the three antigenic regions (C, A, and B, respectively) [4]. DENV enters the host cell when the E protein binds to a yet undefined cell receptor and responds to a reduced pH of the endosome by a conformational change [11]. This conformational change induces fusion of the virus and the host cell membrane. The crystal structure of DENV3 E protein indicates that the serotype-specific mutations that allow viral evasion from immune surveillance (neutralization escape mutations) are all located on the surface of domain III, which has been implicated in receptor binding [12, 13]. The apparent involvement of the host immune system in disease pathogenesis, the so-called antibody-dependent enhancement (ADE), has hampered development of a vaccine against dengue.

The goal of this study is to identify specific regions of the E protein with high potential success as targets for future development of robust diagnostics and vaccines against dengue virus. To achieve this goal we have employed several complementary bioinformatics methods to analyze sequence conservation, selection pressure, immunogenic properties and structural features of the DENV E proteins.

Materials and methods

Residue and site numbers mentioned in the manuscript are based on the DENV2 E protein sequence unless otherwise noted.

Analysis of sequence conservation in the flavivirus genus and dengue species

Sequence conservation of E proteins was analyzed based on structure-guided multiple sequence alignment, neighbor-joining phylogenetic tree, and Shannon entropy. Twelve representative sequences were chosen from the major flavivirus groups where the genome polyproteins are available (Table 1). The taxonomic groups (http://www.ncbi.nlm.nih.gov/ICTVdb/Ictv/fs_flavi.htm) were defined based on the ICTVdB nomenclature of the International Committee on Taxonomy of Viruses [1]. The E protein sequences were extracted from polyproteins in the UniProt Knowledgebase (UniProtKB) [14]. The multiple sequence alignment was manually edited using the Cn3D sequence-structure viewer [15], guided by a reference structure alignment in MMDB [6] using three E protein structures (1OKE, 1UZG, 1SVB), which were originally deposited in PDB [16]. The manual editing involves superimposing and aligning the structures in the structure viewer and then adding individual sequences manually and aligning them with the sequences from the structure. For the C-terminal regions outside the structural alignment (∼100 aa), the sequences were aligned using ClustalW [17] and then manually verified. A phylogenetic tree was generated using the neighbor-joining program in ClustalW with 1,000 bootstrap replicates. The bootstrap values indicate the confidence in the estimated tree branches.

Table 1 E proteins and their source viruses and taxonomic groups in the genus Flavivirus

The sequence conservation of dengue E proteins was further analyzed based on the multiple sequence alignment of 740 full-length proteins from all four serotypes. The conservation in each amino acid position was quantified based on the Shannon entropy, as estimated by the nine-component Dirichlet mixture algorithm [18]. The entropy calculation took into account conservative substitutions of amino acids with similar physicochemical properties.

Evolutionary selection analysis of dengue serotypes

Selection analysis was studied using maximum likelihood methods to calculate non-synonymous/synonymous (dN/dS) nucleotide substitution ratio based on codon alignment and phylogenetic tree. The maximum likelihood methods evaluate the probability that the chosen model has produced the observed data. DNA sequences were extracted from the NCBI nucleotide nt database [19], resulting in four datasets consisting of 146, 269, 121, and 204 sequences for DENV1, DENV2, DENV3, and DENV4, respectively, as well as a fifth dataset totaling 740 sequences with all four serotypes combined. The codon alignment and consensus neighbor-joining tree was derived using MEGA [20]. To quantify selective pressure at a given codon site, estimates of synonymous (dS) and non-synonymous (dN) substitution rates were compared using a statistical test as described earlier [21]. If dS > dN (or, dS < dN), then a site was inferred to be negatively (or positively) selected. We employed two likelihood-based methods—Single Likelihood Ancestor Counting (SLAC) and Fixed Effects Likelihood (FEL) [22] and considered sites to be “selected” when both SLAC and FEL methods yielded a P-value of <0.01. SLAC reconstructs the most likely unobserved ancestral sequences, counts the number of non-synonymous and synonymous changes at every site, and tests whether the number of non-synonymous changes per non-synonymous site is significantly different from the number of synonymous changes per synonymous site. FEL derives the branch lengths and substitution rate bias (global) parameters from the entire alignment, and then directly estimates the ratio of non-synonymous to synonymous rates under a codon-substitution model for each site in a sequence alignment holding all global parameters fixed. The FEL method is in general more powerful but also more computationally demanding than the SLAC method [22]. For this study we reported only those sites, which were concordantly classified with both methods. For the large combined set with all four serotypes, a more stringent cutoff of P < 10−5 was used [22]. Selection analyses were performed using the HyPhy program [23], which took into account nucleotide substitution biases and dS and dN rate variation across sites. To exclude the possibility of erroneous selection inference due to recombination between different DENV serotypes, a maximum likelihood test for phylogenetic incongruence [24] was run to screen for possible recombination in the E protein; no statistically significant recombination breakpoints were identified in any of the DENV datasets.

Prediction of T-Cell epitopes

Three online prediction programs, NetMHC 2.1 (http://www.cbs.dtu.dk/services/NetMHC/) [25], MHCPred 2.0 (http://www.jenner.ac.uk/MHCPred/) [26] and RANKPEP (http://www.mifoundation.org/Tools/rankpep.html) [27], were used to analyze the four serotypes of dengue E proteins. As both MHC-I and -II molecules are highly polymorphic and the specificity of the alleles is often very different, predictions were performed on multiple supertypes to cover the polymorphic loci. NetMHC was used for the prediction for 12 supertypes of MHC-I locus. A binding affinity threshold of 500 nM was used as the cutoff for MHC-I binding [26]. Both MHCPred 2.0 and RANKPEP were used for MHC-II binding predictions to obtain consensus MHC-II epitopes. For MHCPred prediction, 3 supertypes of MHC-II locus were used, with a final binding affinity cutoff of 60 nM for MHC-II binding to obtain approximately the top 5% of the binders that correspond to regions of synthetic peptides reported to induce T-cell immune responses in mice [28]. For RANKPEP prediction, 50 MHC-II locus types were used and the cutoff was set at 4% of top-scoring peptides with above default threshold scores as reported in Ref. [27]. Data from individual supertypes of MHC-I or -II loci were combined for the analysis of each peptide. To ensure better predictive accuracy, only consensus results above the cutoff of both MHCPred and RANKPEP were considered as MHC-II binders.

Analysis of protein structural features

The structural features of dengue E proteins were analyzed using known structures in PDB for DENV2, 1OAN (dimer), 1OK8 (post-fusion trimer), and 1OKE (protein in complex with n-octyl-β-d-glucoside) [8, 29], and the DENV3 structure, 1UZG (dimer) [12]. The extent of exposure of amino acid residues was determined by computing the relative accessible surface area (ASA) using the POLYVIEW server (http://polyview.cchmc.org/polyview_doc.html) [30]. Relative ASA of a residue is the ratio of the ASA of that residue in the protein to ASA of the same residue in the fully extended tripeptide alanine-residue-alanine. Based on the value of the relative ASA, the residues were grouped as buried (0.0–0.60) or exposed (0.61–1.0). To determine interactions across the dimer and trimer interface, the occluded surface (OS) area was computed using the method of Pattabiraman et al. [31]. Residues with OS area >0.5 Å2 were considered as interacting. The conformational rearrangements occurring during the dimer to trimer transition were measured by the changes in the backbone torsion angles ϕ and ψ between the DENV2 E protein dimer (1OAN) and trimer (1OK8). The conformational angles were obtained using the DSSP program [32], and the difference in the backbone torsion angles, Δϕ and Δψ, for each amino acid residue was calculated.

Results

Sequence conservation of E proteins in the Flavivirus genus

Sequence conservation of E proteins among dengue species and other members of the genus Flavivirus was studied based on structure-guided multiple sequence alignment using 12 representative sequences chosen from the major flavivirus groups where the genome polyproteins are available (Table 1). The multiple sequence alignment and the corresponding phylogenetic tree are shown in Fig. 1. The neighbor-joining tree reflects the relationships of the taxonomic groups, where the three major groups (1—Tick-borne viruses, 2—Mosquito-borne viruses, and 3—Viruses with no known arthropod vector) and their subgroups are clearly delineated.

Fig. 1
figure 1

Multiple sequence alignment and phylogenetic tree of 12 E proteins in the Flavivirus genus. The neighbor-joining phylogenetic tree is shown with a scale bar representing branch lengths, numbers above nodes representing bootstrap support (out of 1,000), and the major subgroups are circled. The consensus sequence shown below the alignment indicates residues that are completely conserved (upper-case), highly conserved (lower-case), and in conserved amino acid groups (2: polar, 3: alcohol, 4: charged, 5: aromatic, 6: hydrophobic). Additionally, the background colors for 100, 80, and 60% conserved residues are yellow, blue, and green, respectively. The labels above the alignment indicate the three structure domains (DI, DII, DIII) based on the DENV2 E protein structure (PDB code: 1OAN), the six disulfide bonds (SS1–SS6), the fusion motif, and the receptor-binding motif. The conserved residues involved in critical interactions are boldfaced in red and boxed

The multiple sequence alignment shows 72 completely conserved amino acid residues in the 12 flaviviruses (Fig. 1). These residues include cysteines forming the six pairs of disulfide-bonds [33] that stabilize loop structures in the three structural domains of E protein. Other crucial structure-stabilizing interactions by the flavivirus-conserved amino acids include the D98-K110 salt bridge, several intramolecular hydrogen bonds involving D10, C30, G100, W101, C105, L216, and L218, and the hydrophobic environment provided by leucines (L216, L218, and L264) and V208. Also, completely conserved among flaviviruses are two pairs of residues involved in interactions of domains I and III, R9 and E368, which form a salt bridge, and H144 and H317 that are involved in hydrogen bonds with the main chain of the opposite domain across the interface [34].

The most highly conserved region in flavivirus E protein is 98DRGWGNGCGLFGKG111, where 13 of the 14 residues are strictly conserved (Fig. 1). This peptide, contained in an internal loop between two β-strands on domain DII, corresponds to the known fusion motif [8] involved in DENV infectivity. The highly variable region 380IGVEPGQLKL389, in the lateral loop on domain III, has been implicated in receptor-binding of DENV2 [35] and tick-borne viruses [34].

Sequence conservation of E proteins in the dengue species

The alignment of representative E protein sequences of the four dengue serotypes (Fig. 2A) reveals a very high degree of sequence conservation as expected due to their close evolutionary relationship. A total of 260 residues (∼53% of all residues) are conserved in the multiple alignment of the four sequences. The conservation spans across the entire sequence length, including complete sequence identity of the fusion motif in D98-G111 and the two known N-linked glycosylation sites at N67 and N153 [9] with the Asn-X-Thr/Ser-X potential glycosylation site motif (X can be any residue except for proline).

Fig. 2
figure 2

(A) Multiple sequence alignment and negatively selected sites of E proteins in four dengue serotypes. The underlined residues on the alignment are negatively selected sites in each serotype and sites under negative selection pressure only in a specific serotype and not in the merged dataset are in red. The residues labeled above the alignment (based on DENV2) indicate negatively selected sites in the merged dataset with all four serotypes; underlined residues are sites that are also under negative selection pressure in at least three serotypes; and boxed residues are sites with entropy score of >0.9 bits. (B) The Shannon entropy quantifying sequence conservation of 740 E proteins from the four dengue serotypes. The entropy score (Y-axis) shown in the line graph ranges from 0.3 to 1.1 bits; the blue lines labeled from I to XII indicate conserved regions. (C) Predicted MHC class II binding peptides for each dengue serotype E protein. The symbols in the scatter plot mark the beginning (amino-terminal) of each 9-mer peptide. The Y-axis is the inverse score of the binding affinity (nM−1) predicted by MHCPred. The residue labels (followed by “+” marks) on the top line of the graph indicate binding peptides within the region common to all four E proteins. The top 5 high-affinity MHC-II binding peptides specific to individual serotypes are indicated by dashed lines

Figure 2B shows the Shannon entropy that quantifies the conservation of each amino acid position in the multiple sequence alignment of 740 E proteins from all four serotypes. The entropy ranges from 0.365 to 2.42 bits, with scores <0.6 for highly conserved residues, 0.6–0.9 for conservative amino acid substitutions, and >0.9 for non-conserved residues. The mean entropy of the full-length protein is 0.539 bits, and 55 and 91% of the positions have entropies of <0.6 and <0.9, respectively.

Comparisons of the entropy measure of the three structural domains DI (amino acids 1–52, 134–191, 280–295), DII (aa 53–133, 192–279) and DIII (aa 296–394), and the C-terminal region (aa 394–495) reveal regions of different sequence variability. In particular, domain DIII has the highest variability, with an entropy mean of 0.703 bits and entropies of <0.6 and <0.9 for 44 and 86% of the positions. Interestingly, serotype-specific neutralization escape mutant sites in DENV E proteins are all located on the surface of domain III [12]. Twelve sequence regions are highly conserved in DENV E proteins, containing five or more consecutive amino acid sites with entropy scores of <0.6, namely, N8-G14, V24-D42, R73-E79, V97-S102, D192-M196, V208-W220, V252-H261, G281-C285, E314-T319, E370-G374, K394-G399, and R411-S424, designated as sequence regions I to XII, respectively (Fig. 2B).

Amino acid sites under selection pressure in dengue E proteins

Estimates of synonymous or silent (dS) and non-synonymous or amino-acid altering (dN) nucleotide substitution rates at a given position in a codon alignment have become a standard measure of selective pressure, especially in the framework of maximum likelihood phylogenetic methods [22]. If dS is estimated to be significantly more than dN at a given site, (a dN/dS value of <1), this can be taken as evidence of purifying (negative) selective pressure on that site. That is, for that particular site amino acid changes are, on average, deleterious (negative selection). When the opposite is true, i.e. dN > dS, (dN/dS > 1) there is selective pressure to generate and possibly maintain amino-acid polymorphisms, i.e. undergo adaptive change (positive selection). This unusual condition may reflect a change in the function of a gene or an immediate change in environmental conditions (such as a pathogen’s response to an administered drug) that forces the organism to adapt.

To quantify selective pressure, dS and dN were estimated for each site of the alignment of individual dengue serotypes and also for all serotypes combined. The Tamura Nei (TN93) [36] model of nucleotide substitution bias (out of 203 possible models) was selected for all alignments. To each alignment we fitted one of four models of site-to-site rate variation (Constant: dS = dN = 1; Proportional: dN is proportional to dS, which varies among sites; Non-synonymous: dS = 1, dN varies among sites; and Dual: dS and dN vary among sites independently). Strong evidence supporting the Dual model of rate variation was observed in each alignment (shown in supplementary Table S1). This finding suggests that both dS and dN vary across sites, but there is no simple correlation pattern between the two rates [37].

Ninety negatively selected sites were identified in DENV1, 161 in DENV2, 49 in DENV3, and 57 in DENV4, as shown in Fig. 2A (underlined residues on the alignment), while 186 sites were found to be under negative selection pressure in the large combined set with all four serotypes (labeled residues above the alignment). Because the signature of negative selection is the relative abundance of synonymous substitutions likely due to functional constraints, most negatively selected sites corresponded to conserved residues or substitution of an amino acid by another with similar chemical properties (conservative substitutions). Altogether 138 out of 186 (74%) negative sites in the merged set have low entropy scores of <0.6 bits. Furthermore, 14 sites (C3, C60, R73, T189, F213, A267, F306, T319, S376, F392, K394, S424, G445, and V485) are negatively selected in at least three of the four serotypes as well as in the merged dataset (underlined residues); most are highly conserved and a few have conservative substitutions. Only 6 out of 186 (3%) negative sites in the merged set have >0.9 bits entropy scores (S95, P143, T180, S300, V309, and Y488, boxed residues). Note that while these sites are not conserved within the dengue species, they all correspond to negatively selected sites in one or two specific serotypes, possibly reflective of selective sweeps that have become fixed in individual serotypes and are now maintained by purifying selection (negative selection). Finally, several serotype-specific negatively selected sites were also identified (19 in DENV1, 48 in DENV2, 7 in DENV3, 16 in DENV4) (Fig. 2). Most notable are the 5 (E383 and Q386-L389) DENV2-specific negative sites in the receptor-binding region.

Although there are several reports on the role of positively selected sites in pathogen–host interaction, such as evading host immunity [3842], there are few reports of experimental validation of the functional significance of negatively selected sites. It has been shown that epitopes consisting of negatively selected sites perform better as vaccines than ones containing positively selected sites [43, 44]. The underlying assumption is that because negatively selected sites are less likely to change, due to functional constraints, vaccines or diagnostic targets directed against them may be more effective.

No positively selected sites were detected in any of the dengue serotypes in this study. Our findings agree with previous selection studies where constant dS across all sites was assumed a priori and the data were not stratified based on genotypes and passage types [45, 46]. Comparable analysis of E gene sequences from other flaviviruses, such as St. Louis encephalitis virus, West Nile virus and Yellow fever virus, also did not detect any positive selection [47]. In order to screen for possible selection on amino-acid residues prior to the divergence of serotypes, we estimated dN and dS along the four tree branches separating individual serotype clades in the joint phylogeny using a fixed effects likelihood method [22]. Five sites with evidence of ancient positive selection were suggested (P ≤ 0.001): N83, P132, E174, Q293, and L458.

MHC peptide binding in dengue E proteins

MHC-I peptide binding prediction using NetMHC [25] identified several potential binding regions for each of the four E proteins. However, the overall MHC-I binding affinity was low and the number of high binders was small. Low predicted MHC-I binding was further confirmed using MHCPred (data not shown). These results are consistent with the observation that E proteins mainly induce antibody response to DENV, while non-structural protein 3 (NS3) mainly induces T-cell immune responses to DENV [48].

Many MHC-II binding peptides (Th-cell epitopes) were predicted by both MHCPred and RANKPEP [27] above the affinity threshold (56, 64, 50, and 53 peptides for DENV1, DENV2, DENV3, and DENV4, respectively, had affinities ranging from 1 to 60 nM) (Fig. 2C). Of these, 11 peptides are in regions common to all four serotypes (labeled by beginning residues in Fig. 2C) and represent immunogenic consensus sequence epitopes in the DENV species. The figure also indicates high-affinity MHC-II binding peptides among the top-ranking predictions (dashed lines) that are unique to one of the four serotypes. The predicted binding peptides common to all serotypes generally are present in more conserved regions with low-entropy and/or negatively selected sites, except the last two consensus binding peptides occurring in the C-terminal transmembrane region. On the other hand, most predicted serotype-specific binding peptides are in variable regions with lower sequence conservation. Interestingly, the most highly variable domain DIII contains only predicted serotype-specific binding peptides, but no consensus binding peptides common to all four serotypes.

To cross-validate the predicted results, we further mapped the 64 predicted MHC-II binders of DENV2 E protein to regions covered by synthetic peptides that were previously determined to mimic Th-cell epitopes and to elicit antibody responses in three different mouse strains [28] (amino acid regions of peptides in Supplementary Table S2). As 15 of the predicted peptides are not completely covered within regions of the synthetic peptides, the comparison was based on the remaining 49 predicted peptides. We noted that 39 of the 49 (80% true positive) predicted MHC-II peptides were matched with 16 synthetic peptides that experimentally tested positive for Th-cell epitopes; while the remaining 10 binders (20% false positive) correlated to synthetic peptides that did not elicit an immune response either in vitro or in vivo [28]. Conversely, the computational methods predicted all but one of the 17 synthetic peptides shown to induce an immune response (Table S2), yielding a 94% (16/17) recall rate. The predictive accuracy observed here is consistent with the benchmarking results of epitope prediction programs [49].

Consistent with the notion that the variable surface residues are likely to be responsible for the serotype-specific immunogenic variation, we identified four specific sequence regions (329DGS331, 342LEKRH346, 360EKDS363, and 383EPG385) that also match with predicted serotype-specific Th-cell epitopes and/or neutralizing mAB-binding regions and experimentally determined Th-cell epitopes [28]. These serotype-specific Th-cell binders, coupled with the predicted consensus Th-cell binders (Fig. 2C), reveal dengue immunogenic properties at both the species and serotype levels.

Structural features of dengue E proteins

Figure 3 shows the results of several structural computational analyses to assign functional roles to amino acid residues in dengue E proteins. Based on the relative accessible surface area (ASA), buried residues that are important to maintain the structural integrity of the protein and exposed residues that may provide clues about protein interaction and immunogenicity were identified. The dimer (pre-fusion) and trimer (post-fusion) structures of DENV2 E protein each have about 60 exposed residues (relative ASA >0.6), half of which remain exposed on both the dimer and the trimer surfaces.

Fig. 3
figure 3

Structural analysis results of DENV2 E proteins indicating exposed residues, main chain conformational changes and residues in the oligomer interface, in relationship to sequence conservation. The non-conserved residues (entropy >0.9 bits) are shown in the top line. The conformational changes are plotted based on the difference in the backbone torsion angles (Δϕ and Δψ) from −180° to 180°. The region of residues 144–159 has no atomic coordinates and is not plotted. The exposed residues (relative accessible surface area >0.6) and the residues in the oligomer interface are indicated on the 180° line and −180° line, respectively, with the following notation: in DENV2 dimer (1OAN) only (×), in DENV2 trimer (1OK8) only (+), and in both dimer and trimer (*)

The solved structures for DENV2 and DENV3 show that there are minor structural differences at the viral surface of the two serotypes. It has been suggested that the non-conserved residues exposed on the viral protein surface may be involved in differential antibody binding [12]. Among 43 non-conserved residues across the four serotypes (entropy >0.9 bits), 19 are exposed on the surface of either oligomer, including 5 in the dimer only (K157, P243, Q293, E343, and E360), 1 in the trimer only (L342), and 13 in both the dimer and trimer (N83, K88, K122, E174, D203, Q227, S274, S300, D329, R345, H346, D362, and G385). Domain III alone has 9 exposed and non-conserved residues, including 6 exposed in both the dimer and trimer.

A total of 128 interface residues critical for dimerization and trimerization based on the occluded surface area were identified. In particular, it was noted that several residues in the fusion motif, D98, L107, F108, and K110, are involved in interactions at both the dimer and trimer interfaces of DENV2 E proteins (Fig. 3), as well as the dimer interface of DENV3 E protein (not shown). It was further noted that most of these interface residues are highly conserved within the dengue species. Approximately 73% (94 residues) of all interface residues either have entropy scores <0.6 or are negatively selected, and only 4% (5 residues) are non-conserved with entropy scores >0.9. A few [18] interface residues are among the completely conserved residues in all 12 flaviviruses. The interface residues represent potential candidates for mutation experiments that may alter oligomerization. This region may also be a potential target for inhibitors that prevent oligomerization.

There are significant conformational rearrangements in the main chain during the dimer to trimer transition. The plot of the difference in the backbone torsion angles (Δϕ and Δψ) shows several regions with major conformational changes, such as 1–19, 242–246, 289–298, and 343–350 (Fig. 3). Many residues change from buried to exposed during the transition, suggesting the importance of these residues in the fusion mechanism. For example, buried residues M1, H244, K246, G254, G330, and K344 in the dimer become exposed in the trimer after significant conformational changes in the main chain, while residues including S16, Q52, Q167, S169, P243, D290, Q293, S331, and E343 change from exposed to buried during trimerization.

Comparative analysis of computational and experimental data

To estimate the relative accuracy of our analyses, computational data on sequence conservation, negative selection, structural features, and T-cell epitopes (Figs. 3 and 4) were compared with each other and also with published, experimentally determined functional sites. Such integrated analysis allowed identification of sites that are exposed in the dimer and are also negatively selected (e.g. N37, N67, K88, E195, P217, G266, D290, S300, D362, F373, and E383). Other sites were identified, such as 266GAT268 and 445GAAFS449, which (a) belong to the group of three or more consecutive sites under negative selection pressure in DENV2, (b) have a residue that is negatively selected in at least three of the serotypes (as observed in the merged dataset), and (c) are also part of a predicted epitope. Overall, our computational results are in agreement with experimental information (see supplementary Table S3). The high affinity MHC-II binding peptides predicted here correlated to 80% of the synthetic peptides shown experimentally to induce T-cell immune responses [28]. The correlation between computationally predicted data and available experimental information suggests that the computational approaches used here relate rather well to biological features.

Fig. 4
figure 4

Ribbon representation of the DENV2 E protein dimer structure (1OAN) showing domains I (red), II (cyan), III (blue) with 12 DENV conserved sequence regions shown in pink, purple, and grey, respectively in the three domains. The dimer interface is shown as gold spheres for chain A and light gold for chain B. In addition, residues that are conserved, part of the consensus Th-cell epitopes and exposed in the dimer or trimer (N37, Q211, D215, P217, H244, and K246), are shown as spheres and their sequence positions indicated. The residues H244 and K246 are buried in the dimer (this figure) but are exposed in the trimer

Identification of functionally significant sites

Development of diagnostics with low rates of false negatives and of vaccines difficult to circumvent (by nature or by man) would benefit by identifying the amino acid sites that should remain unaltered in spite of natural changes or artificial modification of dengue virus. We integrated all the computational results described above in this study and searched for sites in E protein that were (a) conserved, (b) consensus T-cell epitopes, and (c) exposed. A site-by-site analysis of the sequence of E protein revealed, as expected, that different features were distributed throughout the sequence. Rather unexpectedly, however, we observed six sites that had more than one feature in a confined region of E protein. The sites having several features might be of particular importance to the viral genome and, therefore, unlikely to change without profound effects on infectivity and/or virus propagation. We considered these six singular sites (Fig. 4; N37, Q211, D215, P217, H244, K246) as potential candidate regions of the E protein for diagnostics and vaccine development. It was noted that 2 sites (N37, P217) are exposed in both dimer and trimer, 2 sites (Q211, D215) are exposed only in the dimer, and 2 sites (H244, K246) are exposed only in the trimer. Out of these six sites, H244 and K246 undergo conformational change between dimer and trimer forms.

Discussion

In this study, the sequence alignment of the flavivirus E proteins (Fig. 1) and the entropy measure and negative selection results for dengue E proteins (Fig. 2A and B) have allowed us to identify sites and regions that are conserved across the flavivirus genus or within the dengue species, as well as variable regions that reflect serotype-specific functional constraints. Such analyses of conserved sites at the different taxonomic levels allow us to differentiate residues of general importance to the infectivity of flaviviruses from residues that may be specific to the dengue viruses. For example, among the 12 dengue-conserved sequence regions (Fig. 4), regions I–IV and VI–VIII are also highly conserved in other flaviviruses, encompassing over half of the 72 completely conserved residues in flavivirus E proteins. These regions also cover 16 of the 18 DENV2 dimer or trimer interface residues that are completely conserved in all 12 flaviviruses. On the other hand, sequence region V (D192-M196) is dengue species-specific. Interestingly, region V overlaps with a 13-amino acid sequence region that contains 11 negatively selected sites identified to be under functional constraints. Such selection pressure analysis of codon-based DNA alignments is ideal for identifying functionally important residues in serotypes that may not be reflected in the amino acid alignments at the species and serotype level [50].

Several dengue-specific sites were identified. For example, while the N153 glycosylation site is conserved in several flaviviruses, the N67 site is unique to DENV. It appears that dengue viruses are heterogeneous in their use of the glycosylation sites [51] and the precise function of the second glycosylation site is still under investigation [52]. The significance of this site is not known, although it has been noted that the loss of the N67 glycosylation site may result in a higher pH threshold for conformational change [53]. We have identified variable residues exposed on the surface of the E protein that are likely to be responsible for the immunogenic variation among dengue serotypes. Especially notable is the sequence region 342LEKRH346 in structural domain III, which consists of five surface exposed residues (1 in dimer, 2 in trimer and 2 in both dimer and trimer) in the beginning of a region (aa 343–350) that undergoes major conformational changes during the dimer to trimer transition, with E343 becoming buried and K344 becoming exposed. Four (L342, E343, R345, H346) of the five residues are not conserved among the serotypes and have entropy scores >0.9. Another variable and exposed region is within the 10-aa receptor-binding region (I380-L389) that contains 4 exposed residues, including the 383EPG385 triad critical for mAB binding [12].

Recent progress in molecular-based vaccine strategies, such as recombinant subunit dengue vaccines, has provided hope for the control of the disease [6]. The phenomenon of antibody-dependent enhancement of dengue disease has spurred attempts to develop a tetravalent dengue vaccine that produces neutralizing antibodies against all four serotypes [54]. Large-scale analysis of antigenic diversity of T-cell epitopes for dengue virus [55] indicates that there are limited numbers of antigenic combinations in E protein sequence variants, and that short regions of the protein are sufficient to capture the antigenic diversity of T-cell epitopes. Taken together, the 11 predicted consensus Th-cell epitopes that we identified, especially the 3 epitopes containing the 6 select target sites, are of special interest as potential candidate regions for inclusion in developing epitope-driven vaccines against dengue viruses. A T-cell epitope-driven vaccine design approach has been used for HIV-1 (e.g. the GAIA vaccine) [56] with promising results [57].

In addition to the six select targets that we propose there are several additional promising regions that can be identified from Figs. 1 to 3 according to specific experimental needs. Furthermore, researchers can evaluate epitopes and diagnostic targets to see if they fall within regions that are under negative selection pressure or are exposed. For example, there are several Nucleotide- and protein-based methods that are currently available for dengue diagnostics [58] and one can choose which targets described in this study are ideal for a specific methodology. The predictive nature of this study allows prioritization of select sites and regions of the DENV E protein for laboratory experimentation with some caveats. First, the development of effective and safe vaccines should involve careful consideration in the selection of exact sequence segments and the choice of specific expression vector systems, both of which are out of the scope of this study. Second, effective diagnostics need to be validated in the laboratory and in the field.

Conclusions

This study provided a priority list of potential target sites in E protein that can be used by experimental biologists involved in dengue diagnostics and vaccine research. A battery of complementary computational tools was necessary to identify salient sites and regions having several desired features simultaneously. This form of detailed computational analysis, coupled with experimental laboratory research, could be instrumental in accelerating the development of viral diagnostics and vaccines.