Background
HIV remains a major global public health issue with an estimated 36.7 million people living with HIV (PLWH) by the end of 2016 [
1]. Since the late 1990s, the progressive availability and success of combination antiretroviral therapy has reduced the risk of opportunistic infections and malignancies in PLWH, remarkably decreasing morbidity and mortality [
1]. Global efforts to strengthen HIV treatment programmes have not only transformed HIV to a manageable lifelong disease, but also constitute the most effective strategy for preventing onward transmission of infection and, thus, expansion of the epidemic [
2,
3]. Nonetheless, the annual number of new HIV infections remains high, with 1.8 million new infections in 2016, and the pace of decline is far too slow to reach global targets [
1,
4,
5]. Thus, global HIV prevention and treatment programmes must be guided by information on the sources of new infections and factors driving epidemic maintenance and growth.
The study of the HIV epidemic by molecular phylogenetics has been revolutionised by tools to assess the structure and dispersal of mainly local or regional epidemics [
6‐
8]. When viruses retain a high degree of genetic similarity relative to others, one can assume that their corresponding hosts are related by one or more recent transmission events. HIV-1 is well suited for these analyses because of its high nucleotide substitution rate, which allows the observation of evolutionary changes over a short time period [
9,
10]. Clustered sequences can infer putative transmission networks, and phylogenetic cluster analysis, combined with epidemiological and demographic data, can help identify the factors underlying the growth of both regional and global epidemics [
11‐
13]. Therefore, large-scale analyses of HIV-1 phylogenies to extract meaningful epidemiological information for evolutionary relationships and transmission history are feasible [
2,
3]. Such studies are important to identify the transmission of drug-resistant variants and to design prevention programmes and public health intervention strategies [
2,
3,
13‐
15].
In this study, we use a large HIV-1 sequence dataset of HIV cohorts from nine European countries and one from Canada to undertake molecular phylogenetic analyses to identify and characterise molecular transmission clusters (MTCs). We also examine the likely impact of clinical and demographic factors on regional phylogenetic clustering.
Methods
Patient data
As part of the EuroCoord collaboration [
16], HIV-1 sequence data linked to epidemiological and clinical data were available for 9265 of approximately 32,000 individuals enrolled by September 2014 into one of 10 cohorts from France, Germany, Greece, Italy, the Netherlands, Norway, UK, Austria, Spain and Canada. A subset of these data was from individuals with well-estimated HIV seroconversion dates (thereafter termed ‘seroconverters’) from the CASCADE (Concerted Action on SeroConversion to AIDS and Death in Europe) collaboration database.
All patients enrolled into the study gave their written informed consent.
HIV-1 sequences dataset
A pooled initial dataset of 18,655 HIV-1 sequences were available, including protease and partial reverse transcriptase (RT) sequences, alone or combined, and some integrase sequences. These were merged into a dataset of 8955 partial
pol sequences (i.e., protease and partial RT). Duplicates were excluded using the online tool ElimDupes [
17], resulting in one sequence per individual. All study sequences were generated as part of routine clinical resistance testing at the participating sites using standard (Sanger) sequencing procedures.
HIV-1 subtyping and reference datasets
Subtyping was performed using the online automated subtyping tools COMET (COntext-based Modeling for Expeditious Typing) [
18] and REGAv.2.0 [
19]. Un-subtyped and undetermined sequences were phylogenetically subtyped as previously described [
20].
MTCs were identified using a large sample of subtype-specific reference sequences from the Los Alamos HIV-1 sequence database [
21] in separate subtype-specific alignments as explained below. Analyses were conducted only for the most prevalent subtypes, i.e. A–D, F and G, and the circulating recombinant forms (CRF) CRF01_AE and CRF02_AG; other subtypes with low proportions in the study dataset (< 0.6%) were not analysed further. Reference datasets for all non-B subtypes, CRF01_AE and CRF02_AG included all
pol sequences (protease and partial RT) that were publicly available at the time of analysis. The number of reference sequences used per subtype was A, 3782; C, 6581; D, 1216; F, 837; G, 1026; CRF01_AE, 2696; and CRF02_AG, 2622. Given the large numbers of subtype B in the HIV Los Alamos database, a final reference dataset of 14,946 out of 42,470 (34.1%) available sequences randomly resampled from different geographic areas and sampling dates was used. All duplicate sequences were excluded prior to analysis.
Study sequences and subtype-specific reference sequences for each subtype and CRF were aligned separately using the MUSCLE programme in subtype-specific alignments [
22]. Alignments were manually trimmed using MEGA 6.0 [
23] and mutation sites described in the International Antiviral Society of the USA’s (IAS-USA) 2017 published list of Drug Resistance Mutations in HIV-1 [
24] were excluded from all datasets prior to any analyses.
Identification of molecular transmission clusters
A two-step analysis approach was followed. Initially, maximum likelihood (ML) phylogenetic inference and bootstrap analysis, as implemented in the RAxML-HCP2 tool, was performed [
25]. ML phylogenies were estimated using the general time-reversible substitution model with gamma rate heterogeneity among sites. MTCs were defined as those clusters with ≥ 2 sequences from the same country having bootstrap support greater than 75% (phylogenetic confidence criterion) and those consisting of sequences from a specific area at a proportion greater than 75% (geographic criterion) compared to the total number of sequences within the cluster. Subsequently, an additional confirmatory analysis was performed for the clusters that initially received lower bootstrap support values, namely those between 50% and 75%. Briefly, the consensus sequence for each cluster was estimated, then, using BLAST [
26], the 100 most relevant sequences to the consensus were downloaded and used for the confirmatory analysis. Phylogenetic analysis was performed using the Bayesian method with the general time-reversible substitution model with Γ-distributed rate, as implemented in MrBayes 3.2.2 [
27]. The confirmatory analysis was performed on a subset of clusters, namely those comprising ≥ 5 sequences fulfilling the geographic criterion, receiving support between 50% and 75%. The Markov chain Monte Carlo method was run for 2.2x10
6 generations (burnin was set to 2x10
5 generations; 10%), with four chains per run. This was sampled every 1000 steps and was checked for convergence, as previously described [
28].
Statistical analysis
Demographic and clinical data are summarised using median and interquartile ranges (for continuous variables), or absolute and relative frequencies (for categorical variables). Simple comparisons of the relevant distributions across different levels of other categorical variables are based on chi-square tests for categorical variables, or non-parametric (Mann–Whitney, Kruskal–Wallis) tests. Associations of the probability of belonging to an MTC with various demographic and clinical characteristics (sex, age, mode of transmission, sampling date, subtype, ethnic group, antiretroviral therapy (ART) experience, country, known seroconversion) were investigated using logistic regression models. All variables were used as categorical variable, except for age, which was used as a continuous variable because its effects did not deviate significantly from linearity. As a sensitivity analysis, the final multivariable logistic regression model was also fitted to subsets of the full dataset, excluding data from each of the three smallest cohorts (the Netherlands, Greece, and France), or all of them simultaneously.
Discussion
Phylogenetic analyses of ~9000 HIV-1 sequences revealed that >40% of them belonged in MTCs. While this observation is consistent with other reports of HIV-1 epidemic dispersal in these countries [
29‐
34], our study is among the first to investigate the structure of these regional HIV-1 phylogenies in greater detail, using a large-scale sequence dataset, dense reference sequence sampling and associating multiple clinical and demographic factors with the dispersal of MTCs.
An additional strength of this study is that all the available sequences of non-B and CRF subtypes deposited in the HIV Los Alamos database were used as reference sequences for phylogenetic analysis. For subtype B, we used more than one-third of the publicly available references sequences (14,946 of 42,470; 34.1%) after random selection representative of the global subtype B epidemic. Finally, MTCs were identified as those clustered sequences fulfilling both phylogenetic (bootstrap value > 75% or posterior probability support > 0.95) and geographic criteria (75% of clustered sequences from the same region). To date, there is no consensus on the methodology used to infer HIV-1 transmission clusters [
35]. In our study we used both geographic and phylogenetic criteria and a large number of globally sampled reference sequences to identify MTCs.
Not surprisingly for these 10 countries, subtype B was the most prevalent subtype in this dataset (84.3%), followed by subtypes C (4.8%), CRF02_AG (3.5%), A (2.9%) and CRF01_AE (2.1%), which is consistent with previously reported data [
29,
36,
37]. Notably, the probability of clustering in an MTC was significantly higher among subtype B than non-B sequences (ORs, CRF02_AG = 0.70, A = 0.65, C = 0.51 and CRF01_AE = 0.36; range of
P-values 0.001–0.016) (Table
3). Some studies have noted differences in the biological properties of HIV-1 subtypes [
38,
39], but there is no conclusive evidence that certain subtypes are more infectious or have higher transmissibility than others. This is most likely because of the high prevalence of subtype B infections in individuals enrolled in the study cohorts versus non-B subtypes and recombinants, rather than differences in the transmissibility and infectivity of subtype B viruses. It was the subtype B form of HIV-1 that was introduced to Western Europe and this remains the most prevalent subtype across Europe [
29,
36]. However, infections with non-B subtypes are more common among individuals from highly endemic areas, with sex between men and women being the predominant HIV risk factor. The only exceptions in Western Europe are Greece and Portugal, where subtypes G and A have spread successfully among the local populations [
29,
40]. Given the characteristics of the spread of these HIV-1 subtypes throughout Western Europe, the finding that subtype B infections have a higher probability of belonging to an MTC reflects that local populations are more likely to be infected within their country (e.g., through regional networks). This hypothesis is further supported by the differences across ethnic groups. In all comparisons, samples from people of white ethnicity were much more likely to contain sequences belonging to MTCs than others (P < 0.001 in all cases). These findings suggest that differences in the probability of belonging to an MTC are likely to be associated with the fact that residents of each country are more closely linked with each, rather than the fact that they are infected with subtype B per se. In other words, if another subtype, such as C, was dominant in Europe, we would probably observe a similar pattern, but with subtype C rather than B. To date, non-B infections in Western Europe (except for Greece and Portugal) are detected either as single lineages – not grouped with others from the same area, or forming small clusters of few sequences [
29,
41]. Our study highlights that non-B subtypes have not been associated with widespread epidemics in Europe, but in some countries there is some evidence for regional expansion [
20,
41,
42].
The subtype B epidemic was first described in the MSM population, but was spread among PWID soon afterwards [
43]. We also found that the MSM population was more likely to belong to MTCs than heterosexuals, PWID and haemophiliacs, suggesting that the MSM population has a greater chance of transmitting HIV between their members (Table
3). Others have also confirmed this trend [
13,
44]. With respect to our findings, there may be a higher prevalence of HIV in this group, a higher probability of HIV transmission through MSM practices or more risky behaviour [
13,
44]. The probability of clustering was also higher among younger and ART-naïve individuals, reflecting that the younger age group may engage in more risky behaviour and has higher HIV-RNA levels [
11].
Finally, the probability of belonging to an MTC differed by cohort country, with higher probabilities observed in Germany and Canada, followed by Spain (Table
3). Since nearly 50% of study sequences were from the three countries with the highest probabilities (namely Spain, UK and Germany), these observed higher probabilities might be explained by the regional expansion of local epidemics [
20,
30,
34].
There are several limitations to this study, as in all molecular epidemiological studies. Firstly, the findings may be distorted by the sampling method used. For instance, in all cohorts, there were more sequences available with more recent sampling dates. Significantly reduced sampling from Greece, France and the Netherlands may have biased our results. To minimise the effect of bias, we used a) highly homogenous inclusion criteria; b) a large-scale sequence cohort dataset and c) a large number of reference sequences (>34% of all available for subtype B and 100% for all other subtypes and CRFs analysed) to infer the fine structure of regional epidemics and dispersal networks. Furthermore, clustering definition of sequences uses both phylogenetic and geographic criteria, enabling higher sensitivity for the identification of MTCs. Although we used stricter definitions for networks, the current definition remains credible because it has been confirmed by Bayesian analysis [
28,
45,
46]. Finally, to avoid sampling bias – especially given the lower numbers of sequences from the Greek, French and Dutch cohorts – we repeated the multivariable analysis after excluding participants belonging to one of these three small cohorts. Results of this repeated analysis yielded estimates with negligible differences compared with the main analysis.
We found that sequences from samples from individuals with well-estimated seroconversion dates and more recent sampling dates had a higher probability of belonging to MTCs in the specific regional cohorts. Given improvements in the depth of sampling and efficiency of sequencing, larger and more complete HIV-1 sequence datasets are now available. This suggests that some of the increase in the regional MTCs might, at least in part, be attributed to better capturing of recent transmission events. This is in line with previous findings, in which recently infected patients were found to be crucial in the spread of the HIV epidemic [
8,
11]. Thus, prevention measures should specifically target these newer MTCs of specific risk groups. The public health implications of such findings, including treatment strategies, are of special interest.