Background
While the role of HIV-1 RNA in the emergence of resistance to highly active antiretroviral therapy (HAART) has been widely documented, less is known about the role of historical viruses stored in the proviral DNA. Following treatment interruption, latent viruses stored in the provirus have been shown to be responsible for the rapid rebound of viral load following treatment interruption [
1‐
3]. However, Palmisano and colleagues suggested that the mutational archive stored in proviral DNA remains unchanged during HAART [
4] with higher levels of DRMs observed in RNA extracted from circulating virus than those present in the provirus [
5]. The suggested reason for these discrepancies is that standard bulk sequencing cannot fully access the spectrum of viral variants stored in the proviral DNA [
5] where resistant viruses may be present in low abundance.
Recent research on HIV-1 in Karonga District, Malawi has shown that of the 40 HIV-1 subtype C infected individuals on HAART that we tested, 14% contained drug resistance mutations (DRMs) in the reverse transcriptase (RT) gene of viruses stored in the provirus [
6] with clonal sequencing of latent viruses showing discrepancies in the presence and prevalence of DRMs in the proviral DNA [
7]. These discrepancies further support the limitations of bulk sequencing for determining drug resistance in viruses contained within the proviral DNA. Traditionally, methods such as single genome amplification and real-time PCR have been used to identify the presence of low abundance viral variants [
8,
9]. We, and others, have shown the importance of clonal sequencing in revealing the spectrum of viral variants present within an individual [
6]. However cloning and other traditionally used methods are labour intensive, expensive, time consuming and/or restricted to the detection of single variants. The advent of next generation sequencing platforms means that clinically relevant low abundance drug resistant HIV variants can be detected to prevalences as low as 1% of the viral population [
10‐
16]. While several studies have found correlations between the presence of low abundance drug resistant viruses and clinical outcomes this has not always been the case [
17,
18]. To our knowledge, no studies have been undertaken to use ultra deep pyrosequencing (UDPS) to quantify the prevalence of low abundance drug resistant viral variants in the proviral DNA. Here, we undertake such a study and endeavor to correlate the presence/absence of such viruses with treatment outcome.
Methods
Sample selection
Samples were collected from five patients recruited to an antiretroviral cohort study between 2007 and 2009 from an ART clinic in northern Malawi. Antiretrovirals have been available in the District since 2005 and in this clinic since 2006. All patients have been started on a fixed dose combination of stavudine, lamivudine and nevirapine on the basis of clinical staging or CD4 count. Previous work identified discrepancies or ambiguities in the presence/absence of drug resistant mutations in viral sequences from samples collected from these five individuals at sequential time-points during treatment [
6]. Fifteen samples representing an average of three sampling time-points from five of these patients were selected for subsequent analysis using ultra-deep pyrosequencing in this current study (Table
1).
Table 1
Samples subjected to ultra-deep pyrosequencing in this study
Patient 2
|
Baseline
| | | 9 | 9 |
| 8 Months | |
K103KN
| | |
|
11 Months
| |
Y181C
| 10 | 20 |
|
14 Months
| | | 3 | 20 |
Patient 32
|
Baseline
| | V90I | 6 | 9 |
|
7 Months
| T215ST | | 2 | 20 |
| 8 Months |
M184I
| | | |
| 14 Months | | | | |
| 15 Months | |
G190AE
| | |
|
16 Months
| | V108AV | 3 | 20 |
|
24 Months
| | | 2 | 20 |
Patient 42
|
Baseline 1 (2008)
| | | 2 | 7 |
|
Baseline 2 (2009)
| V118IV | | 5 | 20 |
| 5 Months | V118I | | | |
Patient 45
|
Baseline
| | V106I, E138A, G190A
| 9 | 20 |
|
8 Months
| | V106I, E138A, G190A
| 10 | 20 |
|
13 Months
| | Sequence Failed | 10 | 20 |
Patient 76
|
Baseline
| | V90I, Y181CY, H221HY | 10 | 20 |
| 3 Months | | | | |
|
6 Months
| |
Y181C
| 10 | 15 |
|
9 Months
| | Sequence Failed | 10 | 19 |
Ultra deep pyrosequencing (UDPS)
DNA extraction and nested PCR was performed as previously described [
6] with the second round PCR primers modified to include the A and B adaptors necessary for 454 sequencing. Each second round primer also included a unique sequence tag (MID) to enable us to distinguish between samples from the multiplexed sequencing reaction. The region targeted for amplification was 760 nucleotides in length covering amino acid positions 13 through 257 of reverse transcriptase (HXB2 numbering). Temperature gradient PCRs identified 57.5°C as the optimum annealing temperature for all secondary PCR reactions with the extended primer sequences for UDPS. Considerable effort was invested into optimizing the PCR amplification approach to account for potential over-amplification of a subset of viral variants present in the quasispecies. For each sample the final PCR strategy involved performing ten first round 100 μl PCR reactions. 5 μl of the product of each primary PCR was used as starting material for two separate secondary PCR reactions, generating a maximum of 20 secondary amplification reactions from each sample. PCR products were quantified on 1% agarose gels in comparison with the Hyperladder 1 molecular marker (Bioline). The 20 PCR products generated from each DNA sample were then mixed in equimolar amounts and an aliquot of the mixed products arising from each DNA sample were then electrophoresed side by side for comparison and further confirmation of quality and quantity. PCR products from all samples were then further pooled in equimolar amounts. Amplicon sequencing was performed on one half of a picotiter plate using a 454 Genome Sequencer FLX by LGC Genomics (Germany).
Data cleaning and analysis
Sequence reads from each individual sample were separated based on their MIDs and subsequent quality control and analysis was performed independently for the reads corresponding to each sample. Next, the data were trimmed using the modified-Mott algorithm implemented in Geneious 5.4.3 [
19] with a 0.01 error probability limit. Initially the identity of the resulting reads to their respective consensus sequence was determined using the BLAST-like word (k-mer) matching approach implemented in Segminator 1.3.2 [
20] with a word size of five and a read quality of two. Reads with significant identity were aligned to the consensus sequence excluding any reads less than 22 nucleotides in length, the dataset was translated to amino acids and the frequency of amino acids at each position was determined using Segminator 1.3.2 [
20].
The prevalence of drug resistant mutations (DRMs) was determined for each sample with three categories of DRMS identified. Firstly, DRMs detected in less than 1% of the reads sequenced from the amplified viral population for that position were discounted to account for potential errors due to the error rate of PCR and UDPS. The second category of mutations consisted of those with a prevalence greater than 1% and less than 20%, corresponding to those that cannot be determined using consensus sequencing. The final category consisted of those DRMs with prevalence greater than 20%, comprising mutations that can, in theory, be observed using traditional consensus sequence genotyping. Given that the number of reads that were sequenced at each timepoint in a patient were unlikely to be exactly the same, we implemented a statistical approach for prevalence calculations as opposed to using a rigid fixed cutoff. For every drug resistant mutation, the 95% confidence interval from the binomial sampling distribution was calculated for the observed frequency taking into account the number of reads sequenced at that timepoint. The upper 95% confidence interval was used as a cutoff for the allocation of DRMs into both the greater than 1% and less than 20% category and the greater than 20% category.
Recent work has described HIV drug resistance mutations that are located within, or adjacent to, homopolymers in the HIV Pol gene [
21]. Given that one of the known shortcomings of the 454 pyrosequencing approach is that it can cause false insertions or deletions in homopolymeric tracts, we undertook an approach to verify if drug resistance mutations at such homopolymers were genuine DRMs or arose as the result of sequencing error. We have developed a novel tool called rapid amplicon mapping in codon space (RAMICS) that uses hidden markov models to map next generation sequencing reads to a reference sequence in codon space, thus generating a biologically relevant alignment (Wright et al., in prep). This approach identifies the presence of hompolymer runs in data and considers them in the mapping process, thus accounting for the possible variances in length as a result of sequencing error. Using RAMICS we confirmed whether resistance calls at DRMs located at homopolymer tracts in the sequence data for all of the datasets were biologically relevant or had arisen as a result of sequencing error.
Permission for the study was received from the National Health Sciences Research Committee, Malawi, and the Ethics Committee of the London School of Hygiene and Tropical Medicine, UK. Written informed consent for participation in the study was obtained from all participants.
Discussion
We have used ultra deep pyrosequencing (UDPS) to investigate the presence of low abundance drug resistance mutations present within the proviral DNA of individuals on ART. Previous studies have suggested that the mutational archive stored in proviral DNA remains unchanged during HAART [
4] with higher levels of DRMs observed in RNA extracted from circulating virus than those present in the provirus [
5]. The suggested reasons for these observations are that standard bulk sequencing cannot fully access the spectrum of viral variants stored in the proviral DNA [
5] coupled with the fact that resistant variants may be present at low prevalence within the proviral DNA. The sensitivity of UDPS to detect drug resistant minor variants is limited by the number of virus templates that can be successfully extracted and amplified using PCR [
12,
22]. Through our thorough PCR approach coupled with the use of UDPS as opposed to consensus sequencing we sought to maximise the genetic heterogeneity that was amplified and subsequently sequenced from the proviral DNA. As might be expected, there was an association between PCR amplification success and sequencing success. For patient 32 month 16, patient 45 month 13, and the 2
nd baseline sample from patient 42, PCR amplification was weaker and thus a higher volume of PCR amplicon had to be added to reach the same final quantity for sequencing as for other samples. For these three samples far more sequences resulting from pyrosequencing had low identity with the consensus sequence and were excluded. It was still worth carrying out pyrosequencing on these samples, however, as thousands of sequences with high identity were returned for analysis. The differences in numbers of quality reads between samples would prevent comparisons of proportions of DRMS across these samples, but not the identification of minor variants.
Bulk sequencing approaches can only detect viral variants present in greater than 20% of the viral population [
23‐
25] and we did see strong correlation between DRMs observed in bulk sequencing and those at greater than 20% of the sequenced viral population in the UDPS at the same time-point. There were, however, further DRMs observed at greater than 20% prevalence that had not been identified using bulk sequencing and the observation of these are likely the result of our approach successfully accessing a greater level of the viral diversity present in the proviral DNA. We would have expected UDPS to detect all of the mutations that were detected via bulk sequencing but this was not the case. From each of five patient samples one DRM, retrieved using the bulk sequencing approach, was absent from the reads retrieved from the samples through the UDPS approach. The phylogenetic origin of each sequence from the bulk sequencing approach was confirmed as belonging to the individual in question, as was the presence of the DRM in the sequence chromatograph indicating that the reporting of the original mutation was not an error, nor was there a mix up of samples. For two samples (patient 32 sample 16 month and baseline 2 samples from patient 42) poor PCR amplification success may be responsible for some mutations being absent. However, for the other three samples this is not the case. Despite utilizing multiple primary and secondary PCRs to maximize the diversity sampled, PCR bottleneck still cannot be excluded as a cause of the absent DRMs. However, as PCR amplifications were from proviral DNA it is perhaps not surprising that some differences would be found between amplicons deriving from different aliquots of DNA, as latent viruses may be more diverse than circulating virus. It should be noted that the presentation of prevalence in our work, as with all other similar studies, should be interpreted as the prevalence observed in the amplified and sequenced viral population as opposed to being a direct measure of the prevalence of a variant in an individual’s viral population. The discrepancy between these two interpretations has yet to be elucidated however the exciting development of degenerative primer ID approaches [
26] will enable such quantification of these potential biases in the near future and help further resolve the PCR bottle neck issue.
Resistant viruses that make up as little as 1% of the viral population within an individual have been suggested to be clinically important as they can expand rapidly under the selective pressure exerted by exposure to HAART [
10‐
16]. The ability of UDPS to effectively quantify such variants is limited by PCR and sequencing induced errors. Hedskog and colleagues eloquently showed that the error rate of UDPS is not uniform across sites within the pol gene and that PCR-induced recombination is minimal [
27]. Learning from reported issues in other studies we used larger sample volumes [
28‐
30], a high depth of coverage [
12,
28‐
30] and a somewhat conservative cutoff of 1% for identifying low prevalence variants. In all patients we identified low abundance variants with DRMs against the current 1
st line therapy. In particular K65R, strongly associated with the development of virologic failure in subtype C infected individuals [
31‐
37], was observed at low abundance in all patients. While we cannot rule out that the observation of low abundance K65R at various time-points in all patients is not as a result of the previously reported propensity for PCR error at this position in subtype C viruses [
38] we did use a high fidelity PCR enzyme and multiple primary PCRs to try to avoid such PCR error.
Despite the observation of low abundance drug resistant variants in all individuals, these viruses appear to have had minimal effect on measured treatment outcome. Due to the geographically isolated setting resulting in logistical difficulties viral loads are not routinely assessed and treatment success, to date, has been measured using WHO stages and CD4 counts. Thus, at the final sampling timepoint one (patient 2) is defined as having immune failure (CD4 cell count <200 cells/mm
3 after at least 12 months on ART) with all remaining patients showing satisfactory treatment response. While the CD4 counts of two of these individuals, patients 45 and 76, were sufficiently high to avoid being interpreted as treatment failures, their semi-borderline CD4 counts (258 and 268 CD4 T cells cells/mm
3) coupled with the presence of multiple minor variant DRMs in their provirus could suggest that these patients may be at high risk for imminent failure. Nevertheless, the CD4 counts of all patients rose over the course of the study despite the detection of low abundance variants at early timepoints and all of the participants remained on 1
st line therapy until their final sampling timepoint up to as much as 24 months after treatment initiation. Thus, it appears that the presence of low abundance DRMs in the provirus of these individuals has little effect on treatment outcome in the short-term. Previous work suggests that HIV rebounds from latently infected cells rather than as a result of continuing low-level replication [
1] and, thus, there is always a possibility that these low abundance resistant variants could emerge to dominate from the proviral DNA following treatment interruption or as a result of poor adherence.
Acknowledgements
This work was supported by Science Foundation Ireland under Grant No. 07/RFP/EEEOBF424 to Grace McCormack and a grant to Simon Travers from the South African Department of Science and Technology. The Karonga Prevention Study is primarily funded by the Wellcome Trust, Grant No. 079828/Z/06, with contributions from LEPRA. RKS is supported by a PhD student bursary from Atlantic Philanthropies. Permission for the study was received from the National Health Sciences Research Committee, Malawi, and the Ethics Committee of the London School of Hygiene and Tropical Medicine, UK. The authors thank Miguel Lacerda, Natasha Wood and Gordon Harkins for insightful discussions regarding the statistical analysis performed in this manuscript.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
VB performed the amplification and sequencing of samples and wrote the first draft of the paper. AC, BN, NF and JRG were responsible for the management and running of the Karonga Prevention Study ART project, the collection of samples and all clinical aspects of the project. RKS undertook the analysis examining the effect of sequencing error at homopolymer regions on the calling of drug resistance mutations. GPM supervised the wet-lab work and the interpretation of results while SAT undertook the preliminary bioinformatics analysis of the sequence data and authored subsequent drafts of the paper. All authors contributed to the final draft.