Background
Human T cell lymphotropic virus-type 1 (HTLV-1) is estimated to infect over 10 million people[
1], and is endemic in sub-Saharan Africa, the south islands of Japan, the Caribbean and parts of South America. HTLV-1 is primarily found in CD4
+ T cells, where predominantly only a single copy of the virus integrates into the genome[
2]. The virus is almost 100% cell-associated, and the viral burden is defined as the fraction of PBMCs that carry the integrated provirus, termed the proviral load (PVL). Infected cells proliferate in vivo, producing clonal populations of cells, each defined by its unique proviral integration site. The viral regulatory proteins Tax and HTLV-1 basic leucine zipper factor (HBZ) are known to drive proliferation of the infected cells[
3‐
5].
More than 90% of HTLV-1-infected individuals remain lifelong asymptomatic carriers (AC), but 1–6% develop an aggressive malignancy known as adult T cell leukaemia/lymphoma (ATLL). A further 0.25 to 4% develop a chronic inflammatory disease of the central nervous system, HAM/TSP, characterised by a slowly progressive spastic paraparesis with pain and neurogenic bladder disturbance[
6]. Risk factors for HAM/TSP include female gender and high PVL[
7].
There is strong evidence that the CD8
+ T cell response is important in limiting PVL, and reducing the risk of HAM/TSP[
8], although innate immunity also plays a role in the host response to HTLV-1[
9]. Certain HLA class I alleles are associated with a reduction in PVL and prevalence of HAM/TSP, in particular HLA-A∗02 and Cw∗08 in a population from Southern Japan[
10]. Tax is the dominant CD8
+ T cell target antigen of HTLV-1[
11,
12]: Tax escape mutations in the HLA-A2-restricted epitope Tax 11-19 are more frequent in individuals with the HLA-A2 allele[
13], and Tax expression is frequently silenced in the expanded clone in ATLL by mutations in
tax or methylation or deletion of the 5’LTR[
14‐
17]. The rate of lysis of Tax
+CD4
+ cells by CD8
+ cells has been inversely correlated with PVL[
18], although Tax mRNA is virtually undetectable directly ex vivo. Individuals who remain asymptomatic were shown to have a lower PVL than those with HAM/TSP at a given lysis rate[
18], and had a greater CD8
+ T-cell lytic efficiency as measured by proportion of Tax-specific CTL which degranulate when exposed to their cognate epitope ex vivo[
19].
Unlike Tax, HBZ expression is uniformly maintained in HTLV-1-infected T cells, including ATLL cells[
4], and this expression correlates with PVL in both ACs and patients with HAM/TSP[
20]. On average, HBZ peptides bind to HLA class I alleles with lower affinity than Tax peptides, and the frequency of HBZ-specific CD8+ T cells[
21] is correspondingly lower. HBZ expression may be maintained because it can drive expansion of an infected clone without presenting a strong target to the CD8
+ T cell response. The frequency of HLA class I alleles that are predicted to strongly bind HBZ peptides is greater in ACs than patients with HAM/TSP, and is inversely correlated with PVL in each group[
21]. These observations suggest that a CD8
+ T-cell response to the HBZ protein is protective against HTLV-1-associated inflammatory disease.
The equilibrium abundance in vivo of a particular HTLV-1-infected T-cell clone is the result of the interplay between the proliferation of the clone and counter-selection by the host response, chiefly the CD8
+ T cell response. Both factors are governed by the program of proviral expression by the clone. Since the proviral sequence is very stable[
22], the chief unique attribute of each HTLV-1-infected T-cell clone is the genomic position of the integrated provirus – the proviral integration site. Specific features of the genomic environment of the HTLV-1 proviral integration site are associated with the frequency of spontaneous reactivation of Tax expression ex vivo[
23]. Integration in the same transcriptional orientation as a flanking host gene is associated with suppression of Tax expression: same-sense orientation is more frequent in high-abundance clones, and more frequent in vivo than during in vitro infection, suggesting that this orientation confers a selective advantage by allowing escape from the Tax-specific CD8
+ T cell response[
23]. There are no published data on the influence of the integration site environment on HBZ expression.
Since the genomic integration site influences proviral expression, we reasoned that the selection pressure exerted by a protective immune response will alter the abundance of clones which have integration site genomic environments with certain characteristics. Our previous reports were neither designed nor powered to examine the relationship between the integration site environment and either disease status or host immunogenetics. In this study, we investigated the differences in integration site environment between Japanese individuals who remained AC and those who developed HAM/TSP, and between those that differed in their capacity to present HBZ peptides on protective HLA class I alleles. We report that integration sites in genes and active regions are significantly more frequent in ACs than in patients with HAM/TSP, even after accounting for clone abundance and PVL. Integration sites in genes are also more frequent in strong HBZ binders.
Discussion
At all levels of clone abundance, ACs had a significantly higher frequency than patients with HAM/TSP of integration sites within host genes and in genomic regions marked by activating epigenetic modifications. This enrichment was associated with disease status per se and was independent of variation in proviral load. The odds ratio of this enrichment in each case was modest (~1.1), however the finding that both integration in a gene and integration in an active genomic region were independently associated with AC status strongly suggests a consistent underlying biological mechanism. The question arises: what are the forces that favour selective survival of these integration sites (i.e. in active transcribed regions) in ACs?
We previously reported that Tax-silenced proviruses were more likely to lie in the same orientation as a flanking host gene or nearby upstream transcriptional start site; we concluded that this effect may be attributed to transcriptional interference[
23]. The integration site locations that inhibit Tax expression were also associated with increased clone abundance. Consistent with these two findings, more abundant clones were less likely to express Tax[
23]. These observations raised the possibility that a provirus integrated in the same orientation as a host gene might enjoy a selective advantage in individuals with an effective immune response because it is less exposed to the strong anti-Tax CD8
+ T cell response. However, the integration site environments associated with Tax silencing were not those associated with an increased frequency in ACs compared to HAM/TSP in this study.
A second possibility is that a more efficient cellular and innate immune response in ACs[
8] means that a clone needs a greater proliferative capacity to reach a given absolute abundance. Interestingly, we also observe an increase in the percentage of integration sites in a gene amongst integration sites from individuals with strong predicted HBZ binding capacity compared those less likely to bind HBZ epitopes, particularly in the less abundant clones. Increasing numbers of HLA class I alleles able to present HBZ are associated with decreasing PVL, suggesting that there is greater control of infected cells in these patients[
21]. Integration sites located in genes and active genomic regions are associated with increased clone abundance ([
24], and this study): we postulate that these environments support virus-driven cell proliferation allowing clones to survive under stronger immune control.
Tax is known to drive proliferation of the infected cell; could the clones which selectively survive in ACs express more Tax? We previously observed, in a very small sample of integration sites (n = 40,[
26]), that Tax-expressing cells had a higher frequency of integration sites in genes. However, with the advent of high-throughput sequencing, we have recently shown that clones that spontaneously express Tax ex vivo have a minor increase in the frequency of integration in a gene and in regions with activating epigenetic marks ([
23] and AM, unpublished observations). Since we have previously observed that more abundant clones are less likely to express Tax[
23], Tax expression is unlikely to completely account for the success of these integration site clones. HBZ is also known to drive cell proliferation[
4], and integration in a gene or active region may also promote increased expression of HBZ. To definitively determine, however, whether transcriptionally active regions promote increased HBZ expression will require high-throughput sorting and integration site analysis of HBZ expressing clones directly ex vivo: this is currently precluded by the difficulty in sorting cells based on detection of HBZ protein in naturally-infected PBMCs. The role of the integration site in driving proliferation via either Tax or HBZ expression is also complicated by the effect of Tax and HBZ on the expression or function of each other.
A recent study in primary infection with BLV, a related retrovirus[
27], has shown that early in infection, integration is favoured in transcriptionally active areas but is strongly selected against by the host immune response. Yet in subsequent chronic infection, abundant clones have a higher frequency of integration sites in transcriptionally active areas. Similarly, in HTLV-1, the effectiveness of the initial host response against expressed viral proteins is likely to define PVL set point, selecting against highly active clones. Clones in heterochomatin may also represent a dead end for the virus, because it may never be re-expressed. During lifelong chronic infection, however, surviving clones with integration sites in ‘intermediate’ transcriptionally active areas may have a proliferative advantage, although other factors (including clone TCR specificity or immune escape by Tax silencing or timing of viral expression[
16,
23,
28]) will also contribute to the relative success of a clone. These transcriptionally active clones are more common in ACs than HAM/TSP, plausibly because they compete against the effective host response in ACs, or less likely, because they are selectively lost (by an unknown mechanism) in HAM/TSP.
There are other differences between HAM/TSP patients and AC individuals, in addition to the effectiveness of the T-cell immune response, which could alter the selection of proviral integration site during chronic infection. For example, in HAM/TSP, proliferation of HTLV-1-infected T cells may be maintained by IL-2[
29] and IL-15[
30], which may reduce the advantage conferred by integration sites that increase expression of proliferation-inducing factors such as HBZ.
Our results reflect systematic differences in the characteristics of HTLV-1 integration sites that persist in vivo between HAM/TSP patients and ACs. We propose that these differences are not themselves causative of the disparate clinical outcomes, but rather they reflect an underlying difference between patients with HAM/TSP and ACs in the efficiency of host-mediated control of HTLV-1 replication, for which there is extensive evidence[
8].
Two previous studies compared the integration site environment between ACs and patients with HAM/TSP and found no differences, or a borderline significant (p = 0.049) difference[
24,
26]. The difference between these reports and the present study may be attributable to the differences in sample size (N = 238 in present study, cp. N = 40 and 24 respectively in[
24,
26]), the quantitative nature and greater sensitivity of the present high-throughput method, or in the ethnicity of the study population (uniformly southern Japanese vs. predominantly Caribbean). The incidence of HAM/TSP is much lower in the Japanese population (studied here) than in individuals of Caribbean origin in the previous studies (0.25 vs 3%[
31,
32]). In the current study, we excluded differences in LM-PCR efficiency and mean clone abundance as causes of the observed differences between patients with HAM/TSP and ACs.
Methods
Subjects
Kagoshima cohort. The study population has been previously reported[
7,
10]. Subjects consisted of 229 patients with HAM/TSP and 202 HTLV-1-infected asymptomatic carriers randomly selected from blood donors; all were of Japanese ethnic origin and residing in Kagoshima, Kyushu, Japan[
7,
10].
Kumamoto cohort. The study population consisted of 98 HTLV-1-infected asymptomatic carriers from blood donors in Kumamoto Prefecture, Kyushu, Japan.
Research was carried out in compliance with the Helsinki Declaration. The study was approved by the Faculty of Medical and Pharmaceutical Sciences Ethics Review Board, University Hospital, Kumamoto, Japan (Ethics 149). All patients gave written, informed consent for the study and for publication of anonymized results.
HTLV-1 proviral load measurements
HTLV-1 DNA was amplified by quantitative PCR in a ABI7900HT FAST real time PCR system using FastSYBR Green (Applied Biosystems) reagents with the Tax-specific primers SK43 and SK44. A control region in B-actin was a
l so amplified using ActF and ActR primers. The rat cell line TARL-2, which contains one integrated copy of the HTLV-1 provirus, was used to generate a standard curve. The sample copy number was interpolated from the standard curve and PVL was expressed as number of infected cells per 100 PBMCs. Proviral load data for the Kagoshima HAM/TSP and AC cohorts, with TARL-2 as pX region control, were as previously described ([
7]). SK43: 5′CGGATACCCAGTCTACGTGT, SK44: 5′GAGCCGATAACGCGTCCATCG, ActF: 5′TCACCCACACTGTGCCCATCTATGA, ActR: 5′ CATCGGAACCGCTACTTGCCGATAG.
HLA class I alleles
HLA class I typing of the Kagoshima cohort was reported previously ([
10]). HLA typing of the Kumamoto cohort was done by Luminex reverse SSOP at the Hammersmith Hospital, London, UK to 2 digit resolution with ’strings’ of possible 4 digit resolution alleles. For each individual’s string of possible alleles, the most frequent 4-digit allele in the Japanese population (represented by a study of 1018 Japanese individuals[
33]) was identified as the most likely allele. If there were multiple possible alleles with a population frequency >3%, all these frequent alleles were retained as possibilities for the individual. If no allele subtype in the string was represented in the population study, all alleles in the string were retained.
Epitope binding prediction
We used the Metaserver algorithm (detailed in[
21]) to predict HLA class I epitopes. Metaserver combines predicted TAP transport, proteasomal cleavage and HLA–peptide binding from NetCTL and NetMHC to predict peptide binding to HLA Class I A and B alleles present in the Kagoshima and Kumamoto cohorts. For each HLA class I A or B allele, all HTLV-1 peptides were ranked by binding score. The rank of the top HBZ peptide was recorded for each HLA class I allele. The rank of top binding peptide for a protein is a more robust method for comparisons between alleles than affinity[
21]. HLA class alleles where the rank of the top HBZ peptide was < 20 were included as predicted binders of HBZ. A0201 was also included as a binder (HBZ top rank = 29) as it had been experimentally shown to present HBZ peptides[
34]. For alleles not included in Metaserver, we assessed the rank of the binding score using NetMHC only. Alleles where the rank of the top HBZ peptide was < 20 were added to the predicted binders of HBZ. Approximately 30% of individuals have a detectable CTL response to HBZ[
35]. Individuals with two or more HBZ-binding alleles constituted 39% of the combined cohorts, and were designated ‘strong HBZ binders’ representing individuals who were most likely to be able to mount an HBZ-specific response.
Integration site analysis was carried out on samples from patients with HAM/TSP and ACs from the Kagoshima cohort where sufficient genomic DNA was available (HAM/TSP, n = 104; AC, n = 100). To maximise statistical power and to test for reproducibility, we also analysed DNA of ACs from a neighbouring prefecture in Japan, Kumamoto (n = 98).
HTLV-1 integration sites were mapped and their abundance quantified as previously described ([
23,
24]). Genomic DNA from peripheral blood mononuclear cells was randomly sheared by sonication and ligated to a partially double-stranded DNA adaptor that incorporated a 6 nt barcode, a reverse sequencing primer site and the P7 sequence for paired-end sequencing. Two rounds of nested PCR were performed between the HTLV-1 LTR and the adaptor, adding a paired end P5 sequence in the LTR primer. The resulting amplicons were combined into libraries of up to 42 samples and sequence data were acquired on an Illumina GAII or HiSeq platform with paired-end 50 bp reads and a 6 bp index (barcode) read. Paired reads were aligned to a human genome reference (build 18, excluding haplotype and “random” chromosomes) using ELAND. A random set of integration sites was derived from ~ 190,000 50 bp human genome sequences generated using Galaxy, and aligned to the same human genome reference to control for any bias due to alignment limitations.
Unique integration sites (defined by Read1) were quantified on the basis of number of distinct shear sites identified (determined from paired Read2) and calibrated to provide a count of number of sequenced sister cells per clone. The absolute abundance (
Aabs) of a T cell clonal population, defined by a single HTLV-1 integration site, was calculated as follows:
where
Si is the number of sister cells of the
i th clone, T is the total number of observed clones in the patient, and
PVL is the number of infected cells per 100 PBMC in the patient. Clones were assigned to absolute abundance ranges of <1 per 10
4 PBMC, 1-10 per 10
4 PBMC, >10 per 10
4 PBMC. LM-PCR integration site data sets for each patient were subjected to successive quality control filters as previously described ([
23,
24]); LM-PCR was designated ‘successful’ and the data included in further analysis only when the sample contained a minimum of 50 sister cells, a minimum of 15 clones, an average of more than 15 sequence reads per sister cell.
Oligoclonality index
To measure the diversity of clone abundance in the infected cell population from each individual, we used the oligoclonality index[
24] which is based on the Gini Index. This index measures the non-uniformity of the distribution of clone abundance: a value of 0 indicates that all clones have the same abundance and 1 is an upper bound where the proviral load effectively consists of a single clone.
Estimating total number of clones in the blood
Estimation of total numbers of clones in the blood (observed and unobserved) was carried out using a computational diversity estimation approach (DivE), as described previously[
25]. Briefly, many mathematical models were fitted to species-accumulation data, and to successively smaller nested subsamples thereof. Novel criteria were used to score models in how consistently they can reproduce existing observations from incomplete data. The estimates from the best performing models were aggregated (using the geometric mean) to estimate the number of clones in the circulation.
Genetic and epigenetic analysis of integration site
Analysis of the genomic region surrounding the integration site was carried out as previously described ([
23,
24]). Specifically, the following attributes were identified for each integration site: location within/outside a transcriptional unit and orientation of the provirus versus that gene, proximity to a CpG island, and counts of selected histone marks within a 10 Kb window around the integration site. Locations of transcriptional units were retrieved from the NCBI (
http://ftp.ncbi.nih.gov/gene/), CpG island data from UCSC tables[
36], and histone marks from ChIP-seq experiments on primary CD4+ T cells (detailed in[
24]). integration site positions were compared to the locations of specific relevant annotations using the R package hiAnnotator (
http://malnirav.github.com/hiAnnotator), kindly provided by N. Malani and F. Bushman (University of Pennsylvania, USA). An integration site was designated as being in an area enriched in a particular histone mark if there were more counts of the mark within a window of 10 Kb around the integration site than around 90% of random sites.
Statistical analysis
Statistical tests were performed using R version 2.15 (
http://www.R-project.org/). A multivariable linear regression model was used to compare log PVL with the number of HBZ-binding HLA alleles per individual, taking into account disease status, as well as age and sex in the Kagoshima cohort-only analysis as these have been suggested to vary with PVL. Differences in OCI, PVL and total number of sisters between cohorts were analysed by non-parametric Mann-Whitney U tests as distributions were non-normal. Spearman correlation was used to determine the correlation in each cohort between log PVL and either log estimated number of clones in the blood or OCI. A likelihood ratio test was used to compare a null multivariable linear regression model associating log clone number with log PVL, disease status and HBZ binding capacity to an alternative model which added an interaction term between PVL and HBZ binding capacity. This allowed a test of whether the association between total clones in the blood and PVL differed with HBZ binding capacity. A second likelihood ratio test compared the same null model with a different alternative model including an interaction term between log PVL and disease status. This method was repeated for the association of OCI with log PVL.
In the genomic environment analysis, integration sites were grouped by absolute abundance range, and by the prefecture, disease status and HBZ binding status of the patient. For integration site data statistical analysis, the two AC cohorts were combined into a single AC group as they showed very similar integration site characteristics. Within subsets, the proportion of integration sites located within genes was plotted as odds ratio compared to random integration sites. Chi-squared tests were used to compare the total numbers of UIS lying inside or outside genes at each absolute abundance bin level. A chi-squared test for trend was used to measure the significance of a trend within a cohort across bins of increasing abundance. We plotted the mean number of a specified epigenetic mark within a 10 Kb window around integration sites within a group (N) divided by the mean number of that mark within 10 Kb of in silico random sites (N random). Mann-Whitney U tests were used to compare the numbers of specified epigenetic marks near integration sites from ACs versus HAM/TSP patients. Spearman correlation was used to test the association between log absolute abundance of an integration site clone and the frequency of a specified epigenetic mark within 10 Kb of the integration site.
Correction for multiple comparisons was made using a Bonferroni-Holm correction to control the family-wise error rate for each set of tests (Mann-Whitney, Spearman) carried out across all analysed epigenetic marks and integration site subsets. Two epigenetic marks (H3K9ac, H3K4ac) were analysed but not reported as their results were very similar to reported ones; the p values from their analyses were included in the calculation of the Bonferroni-Holm correction. Correction for multiple comparisons was also made across subsets in the analyses (Chi-square, Chi square test for trend) of integration sites in genes.
A multivariable logistic regression model was used to test whether disease status was independently associated with integration site in a gene, active genomic region and inhibitory genomic region at the integration site level (all integration sites used in analysis) or patient level (characteristics of integration sites averaged per patient). Host HBZ binding status and PVL, as well as log absolute clone abundance were controlled as additional factors in the model as these may vary with disease status and integration site environment. A multivariable logistic regression model was also used to test the association of integration in the same orientation as a gene or TSS with disease status (controlling for HBZ binding status, PVL and clone abundance). A further multivariable logistic regression tested the association of host HBZ binding status with integration in a gene (controlling for disease status and clone abundance) at the patient and integration site level.
Acknowledgements
We thank Laurence Game, Nathalie Lambie and Adam Giess of the Genomics Laboratory at the MRC Clinical Sciences Centre, Hammersmith Hospital, London, UK for high throughput sequencing and alignment; Paul Brookes and Corinna Steggar of Histocompatibility and Immunogenetics in Clinical Immunology, Hammersmith Hospital, London, UK for HLA typing of the Kumamoto cohort; Nirav Malani and Frederic D. Bushman of the Department of Microbiology, University of Pennsylvania, Philadelphia, PA, USA for the list of random integration sites and the hiAnnotator package; Mitsuhiro Osame and Koichiro Usuku for collection of the Kagoshima cohort; and Lucy Cook and Aileen Rowan in the Section of Immunology for discussion and comments.
This work was funded by the Wellcome Trust (UK).
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
HAN designed and performed the experiments, analysed the data and wrote the manuscript. DJL analysed the total clone number data. AM contributed integration site analysis tools and participated in data analysis. BA and ME carried out epitope binding analysis and statistical analysis. MM participated in the design of the study and collected clinical samples. CRMB conceived the study, designed experiments and wrote the manuscript. All authors read and approved the final manuscript.