Background
Malaria is one of the widespread parasitic infections affecting humans and World Health Organization malaria elimination programmes are threatened by the emergence of drug resistance. Cambodia is considered to be the epicentre of
Plasmodium falciparum resistance to the first-line treatment, artemisinin-based combination therapy [
1]. The emergence and existence of artemisinin resistant parasites is now reported at different locations in Greater Mekong Subregion (GMS) [
2‐
10]. Understanding the mechanism underlying resistance is crucial for the identification of new drug targets and restriction of spread of artemisinin resistance to high malaria transmission areas such as Africa.
The fragmentation of Cambodian
P. falciparum population into artemisinin resistant (founder subpopulations), sensitive (ancestral population) and admixed subpopulations was shown in recent reports [
11‐
13]. Furthermore, four non-synonymous mutations (C580Y, Y493H, R539T and I543T) in the propeller domain of the Kelch gene
k13 (
PF3D7_1343700), located on chromosome 13, were described as determinants of clinical artemisinin resistance in Cambodia [
14]. Nine other non-synonymous mutations in the
k13 were associated with clinical artemisinin resistance in GMS [
2‐
10]. The
k13 mutant alleles exhibiting clinical artemisinin resistance have not been observed outside the GMS [
10]. Some of the
k13 artemisinin resistant alleles have emerged independently with different geographical origins in different locations in SEA [
4,
8] and have spread to neighboring countries [
15]. This could be due to different selective pressure at different locations [
10]. The most dominant and transmissible allele reaching fixation is observed to be C580Y in Cambodia, Thailand–Myanmar border, Vietnam and southern Laos, with a single origin in western Cambodia [
8‐
10,
15,
16]. There could be certain essential genetic background mutations facilitating the artemisinin drug resistance through selection, a similar phenomenon observed with the resistance to chloroquine, pyrimethamine or sulfadoxine in the 1960s in the GMS [
1,
17]. Emergence was also suspected to be related to the small and well-structured parasite population in the GMS [
18] The relative low transmission preventing the development of protective immunity in the human population may have helped emergence of drug resistance, especially at the border between Cambodia and Thailand [
19]. The other reason for emergence of resistance could be the early introduction of anti-malaria drug based treatments, with inappropriate dose or usage that may have led to continual drug exposure of malarial parasite populations [
20,
21].
Based on genome wide analysis, four genes
Pffd,
Pfcrt,
Pfarps10 and
Pfmdr2, have been described as background genes supporting
k13 resistant alleles [
4]. Although, these markers certainly determine artemisinin resistance, the underlying resistance mechanism is still unknown. Mbengue and colleagues have described a K13-PI3K pathway to be potentially associated with artemisinin resistance, and that PI3K could be targeted for understanding resistance mechanism [
22]. Also, biological process and pathways like ubiquitination, oxidative stress response and unfolded protein response pathways, are associated with artemisinin resistance [
23,
24]. The present study aims to associate the biological processes and metabolic pathways with the early
P. falciparum population structure in Cambodia, by analysing mutations in parasites isolated in Cambodia during the early period of emergence of artemisinin resistance (2008–2011). An integration of population structure, genetic variations, networks and annotation data was performed using a theory described here as emergence-selection-diffusion (ESD) model. Bioinformatic results on significant mutations and related biological pathways were used to test the model and evaluate the process of emergence and selection in specific small subpopulations, followed by diffusion in an admixed population. This model defined for a descriptive approach is similar to the population shifting balance and metapopulation theories in population genetics that were successfully implemented for the analysis of
P. falciparum populations [
25,
26].
In total, 21,257 non-synonymous SNPs were identified using specific filters among 167 isolates from the genomic dataset on which the parasite population structure was described earlier [
12]. Based on hierarchical clustering and network-based stratification [
27], eight specific subpopulations were identified among four regions in Cambodia. An extended list of 57 background genes associated with artemisinin resistant parasites in Cambodia is described. In addition to the identification of known targets, functional analysis reveals a strong interplay between ubiquitination and cell division in artemisinin resistant parasites.
Methods
Data acquisition
The genome sequencing data of Cambodian parasite isolates was recovered from the ENA (European Nucleotide Archive) database server, submitted by the Welcome Trust Sanger Institute. Out of the 293
P. falciparum whole genome sequences from 2008 to 2011, which were used to define the
P. falciparum parasite population structure in Cambodia [
11,
12], 167 sequences were recovered successfully (Additional file
1) in BAM format and converted to much readable VCF (v4.1) files [
28] using SAMtools v0.1.19 [
29].
Plasmodium falciparum 3D7 strain (genome sequence annotation version 2) was used as the reference genome, and was recovered from the PlasmoDB Plasmodium Genomics resource database (release version 5.5) [
30]. The 167 genome sequences originate from four locations: Pailin (14 sequences), Tasanh (26 sequences) and Pursat (81 sequences) in western Cambodia and Ratanakiri (46 sequences) in eastern Cambodia.
Filtering data for noise
A reliable variant calling pipeline was established for the identification of significant SNPs (Additional file
2). The analysis was focused on non-synonymous SNPs in the coding region of the genome, which occur in at least one of the isolates, and does not take insertion/deletion (INDELs) into consideration. The SNP data was filtered for noise based on the VCF (v4.1) file signal parameters. For identification of thresholds on these parameters to filter the data, at first, around 100 kb were removed from starting and end of each chromosome. These 100 kb regions at the extremity of
P. falciparum chromosomes encode for multi-gene families, most of them encoding putative surface antigens (including VAR, Rifin, Stevor). An average value for all the signal parameters over 167 isolates were calculated for each SNP position as sum of the values in different isolates over the number of isolates having a non-reference allele. The average quality values per position were plotted along the genome, and a clear threshold in the mapping quality (MQ) parameter was observed at value 29 (Additional file
3). Removing chromosome ends removed most of the signal below the score 29. Therefore, mapping quality score higher than 29 was considered as one of the filtering criteria. Many SNPs showed MQ ≤ 29 in the coding core of chromosome 4 and 7, as internal VAR gene clusters are present in these chromosomes. No other quality parameters plotted along the genome, with or without the chromosome end regions showed any change in distribution of SNPs, on which a significant threshold could be defined. In order to select the SNPs with high quality non-reference (non-REF) signal, DP4, the parameter accounting for high-quality forward strand REF bases, reverse strand REF bases, forward strand non-REF bases and reverse strand non-REF bases for a specific position was analysed. To account for the percentage of non-REF signal (referred to as DA in the manuscript) and choose a threshold, the distribution of proportion of non-REF alleles (forward strand non-REF bases + reverse strand non-REF bases) over the sum of all alleles was analysed. The value 0.7 was chosen, as it seemed to be the intersection of two distributions: low quality SNPs on the left and high quality SNPs on the right (Additional file
4). Therefore, DA score ≥ 0.7 was considered as the second filtering criteria to include SNPs with high quality non-REF allele calls. Around 20,000 SNPs were recovered for each isolate after implementing these two filters (min: 13,470, max: 23,022 SNPs). There were 247,783 SNPs having a non-REF allele in at least one of the 167 Cambodian parasite genomes.
In order to remove the SNPs with rare allele frequencies in the population, minor allele frequency (MAF) for each SNP was analysed (Additional file
5). It is defined as the minimum of non-reference allele frequency (NRAF) and 1-NRAF [
11,
31]. All the SNPs with a minor allele frequency (MAF) less than 0.01796 were removed from the analysis (Additional file
5). After applying the three filters, 111,701 unique SNPs were recovered in 167 Cambodian isolates.
Correspondence between different genome versions
The correspondence of the SNP coordinates between P. falciparum genome sequence version 2 and 3 was recovered using BLAST (Basic Local Alignment Search Tool) from NCBI (National Centre for Biotechnology Information), for each chromosome separately. For the chromosomes 4, 7, 8, 10 and 13 very short lengths of alignments were obtained depicting major changes in the genomic sequence between version 2 and 3. Correspondence for these chromosomes was then obtained by defining specific regions for BLAST. Correspondence for approximately all SNPs in all chromosomes was recovered, and the 3105 unmapped SNPs in the recovered dataset were removed from the analysis. NCBI BLAST was performed for each chromosome, with genome version 2 (PlasmoDB release version 5.5) as reference and version 3 (PlasmoDB release version 10) as query, to recover a list of 108,596 SNPs.
Removing uncertain SNPs and correcting errors
After filtering the data and removing the unmapped SNPs, the uncertainties were treated. SNPs with more than one ALT (non-REF) allele for a specific isolate were considered as uncertainties. SNPs having the uncertain ALT allele frequency higher than 40% were removed (830 SNPs). For the cases where SNPs had only one ALT allele in most of the isolates and uncertain ALT allele in some isolates, the uncertain allele was substituted with the ALT value (16,859 SNPs). For the other cases where SNPs had more than one ALT allele for different isolates, the uncertain ALT alleles were substituted with the most frequent ALT allele (1772 SNPs). The ALT alleles with the frequency 1.5 times the frequency of same allele at random, are considered as the most frequent ALT allele for SNP. In the case of uncertain ALT allele frequency less than 5% and no majority ALT allele, the uncertain ALT allele was substituted with the REF allele (54 SNPs). All the other cases were removed from the analyses (1228 SNPs). A total of 106,538 unique SNPs were recovered in 167 Cambodian isolates with 18,683 modified SNPs (Additional file
6).
Annotation of the recovered SNPs
To describe the distribution of the SNPs over the genome, the 167 strains were annotated using VCF-annotator Perl script (developed at the Broad Institute, Cambridge, MA). The GFF3 (General Feature Format version 3) file was recovered from Ensembl database server (release version ASM276v1.21) corresponding to the
P. falciparum genome sequence version 2. The annotation shows that the recovered SNPs are mostly in introns, exons and 5pUTR. To define the coordinates of the chromosome end regions the chromosomes corresponding to genome version 3 (PlasmoDB release version 11) were visualized using the genome browser provided by PlasmoDB. The regions containing the genes with descriptions such as CLAG, DBL, Rifin, hyp, Stevor, GARP, RESA, VAR, PfEMP, Surfin, PHIST, KAH and EMP in a consecutive organization, were considered as chromosome end regions. The gene location in the coding region was determined and chromosome end coordinates were defined (Table
1).
Table 1
Chromosome end coordinates
Chromosome 1 | 1–117,000 | 481,500–640,851 | 640,851 |
Chromosome 2 | 1–120,000 | 783,000–947,102 | 947,102 |
Chromosome 3 | 1–135,000 | 1,002,000–1,067,971 | 1,067,971 |
Chromosome 4 | 1–174,000 | 1,067,000–1,200,490 | 1,200,490 |
Chromosome 5 | 1–49,000 | 1,297,000–1,343,557 | 1,343,557 |
Chromosome 6 | 1–74,000 | 1,293,000–1,418,242 | 1,418,242 |
Chromosome 7 | 1–91,000 | 1,320,000–1,445,207 | 1,445,207 |
Chromosome 8 | 1–90,000 | 1,296,000–1,472,805 | 1,472,805 |
Chromosome 9 | 1–127,000 | 1,380,000–1,541,735 | 1,541,735 |
Chromosome 10 | 1–112,000 | 1,515,000–1,687,656 | 1,687,656 |
Chromosome 11 | 1–138,000 | 1,934,000–2,038,340 | 2,038,340 |
Chromosome 12 | 1–98,000 | 2,130,000–2,271,494 | 2,271,494 |
Chromosome 13 | 1–129,000 | 2,808,000–2,925,236 | 2,925,236 |
Chromosome 14 | 1–71,000 | 3,129,000–3,291,936 | 3,291,936 |
The final set of 21,257 non-synonymous SNPs in the coding region was described, using the recovered annotation and chromosome ends region coordinates (Additional file
7). This dataset is referred to as IBC dataset in the manuscript and is used for parasite population study in Cambodia. There are 3714 modified SNPs (as described in the section above) in the set of these 21,257 SNPs.
Clustering
To describe the
P. falciparum population structure in Cambodia, unsupervised hierarchical clustering was performed on the IBC dataset (all the statistical analysis is performed in R v3.0.1 and v3.2.3). The pairwise distance between two isolates was estimated as the proportion of base substitution between them over the whole set of recovered SNPs. Ward minimum variance method was used as a metric to build the dendrogram. The correspondence between previously described parasite subpopulations in Cambodia [
12] and the 167 isolates were recovered from the Sanger Institute. Eight subpopulations were described based on the hierarchal clustering results: KH1.1, KH1.2, KH2.1, KH2.2, KH3, KH4, KH5 and KHA. In order to choose the optimal number of clusters in the dendrogram, the value of k was set to 2 to 10 and the clusters obtained at k = 8 overlapped both, clusters based on different
k13 alleles and the previously described KH subpopulations. By further increasing the k, only the admixed subpopulation KHA was further divided into small subpopulations.
Significant SNPs and genes
To describe the metabolic pathways and functions associated with different subpopulations, significant genes were recovered based on significant SNPs. For each subpopulation significant SNPs were defined using one-tailed Fisher-exact test, by comparing the ALT allele frequency of each SNP in each subpopulation to the to the ALT allele frequency in artemisinin sensitive population KH1.1, which is considered as the ancestral population [
12]. Only the SNPs with ALT allele increased frequency in a particular subpopulation were considered. The Benjamini–Hochberg method was used to correct the
p values from multiple comparisons. All the SNPs with corrected
p value lower than 0.05 were considered as significant and the genes containing these significant SNPs were defined as the significant genes (Table
2).
Table 2
Number of significant SNPs and genes in each subpopulation compared to ancestral artemisinin sensitive subpopulation
KH1.2 | 5 | 1361 | 823 |
KH2.1 | 11 | 1620 | 938 |
KH2.2 | 22 | 2312 | 1125 |
KH3 | 12 | 1495 | 859 |
KH4 | 9 | 1891 | 978 |
KH5 | 14 | 1612 | 900 |
KHA | 49 | 740 | 493 |
Gene interaction networks and gene ontology
In order to determine the biological processes associated with different subpopulations, gene–gene interaction networks were analysed. The interaction network data was recovered from STRING v10, which provides functional and predicted protein–protein interactions from other publicly available data sources and literature [
32]. Networks based on co-expression data for all the subpopulations based on significant genes were recovered. Only edges with a STRING confidence score for co-expression higher than 0.5 were considered for the analysis. The networks were imported and analysed in Cytoscape v3.2.1 [
33].
The complete
P. falciparum interaction network based on co-expression data (confidence score ≥ 0.5) was also recovered from STRING v10 database. Out of the 5777 genes identified in the genome of
P. falciparum (PlasmoDB release v24), only 3875 genes had co-expression interaction confidence score greater than 0.5. Out of these 3875 genes, 33 genes did not have interactions with the major interaction network and are not considered in this analysis. The 3842 genes were classified into six parasite intra-erythrocytic stage forms (early ring, late ring, early trophozoite, late trophozoite, early schizogony and late schizogony). These genes were classified according to the maximum expression stage data, which was based on microarray transcriptomic data of the study by Le Roch et al. [
34], available in PlasmoDB server (“Pf-iRBC + Spz + Gam Max Exp Timing” column). The genes with maximum expression in merozoite intra-erythrocyte stage of the parasite were not focused (711 genes). Also, 138 genes were not classified into any blood stage according to the maximum expression stage data. The remaining 2993 genes were majorly distributed into ring and trophozoite, followed by schizont blood stage (Table
3).
Table 3
Number of genes in P. falciparum gene–gene interaction network with maximum expression in different parasite blood stage forms
Ring |
Early | 702 | 1043 |
Late | 341 |
Trophozoite |
Early | 548 | 1082 |
Late | 534 |
Schizogony |
Early | 391 | 868 |
Late | 477 |
Merozoite | 711 | |
Unclassified | 138 | |
Total | 3842 | |
For identification of biological function associated to the significant genes, the functionally grouped networks of GO terms and pathways were recovered and analysed for each subpopulation using the ClueGO v2.2.4 [
35] and CluePedia v1.2.4 [
36] plugins of Cytoscape. The network is created with nodes as the term and edges as the association based on kappa score. All the default conditions of the ClueGO plugin were used. Right-sided hypergeometric test (enrichment) was used as the statistical test and the Benjamini–Hochberg method was used to correct
p values.
Network based subpopulation description
The population structure of 167 parasite isolates was also questioned using the method of Network Based Stratification [
27] which clusters the isolates together having diffusion paths associated to mutated genes in similar network regions. The list of mutated genes for each isolate was projected on the full
P. falciparum interaction network recovered from STRING v10 [
32]. All prediction sources such as co-expression, co-occurrence, gene fusion, databases, experimental evidence, text-mining and neighbourhood were considered for the full
P. falciparum interaction network. The results were obtained for the top 10% gene–gene interactions based on combined confidence score provided by STRING database [
32]. The mutated genes were propagated to the neighbourhood network (network smoothing) and based on the node score matrix of the resulting diffusion network for each isolate, clustering was performed using non-negative matrix factorization (NMF) method [
37]. Consensus clustering was performed by selecting 80% mutated genes and 80% isolates 100 times randomly and iterating NMF clustering 10 times. The consensus clustering between samples was estimated as the percentage of co-clustering results in which they are in the same cluster when the dendrogram is cut to obtain 8 groups (to make the comparison easy with the 8 KH sub-populations). This consensus clustering matrix was normalized and then used to build a dendrogram for all 167 isolates using Euclidean distance matrix and ward minimum variance method in R.
Some of the isolates were grouped in different clusters, compared to the clustering results based on 21,257 SNPs. There were 10 isolates (isolate index: 9, 70, 91, 103, 112, 134, 148, 151, 160, 162) clustering in KHA (based on 21,257 SNPs), but according to clustering based on network based stratification these isolates clustered in other resistant and sensitive subpopulations. Also one isolate (isolate index: 156) was classified in KH2.2 which was in KH2.1 earlier and one isolate (isolate index: 61) is classified in KH2.1 which was in KH3 earlier. All the other 155 isolates completely overlaps with the clusters observed previously.
Ribosome S10 protein structure prediction
Models were built using four independent structure prediction servers, IntFOLD [
38], Phyre2 [
39], RaptorX [
40] and LOMETS [
41]. For each model, the top model from each of the four servers was analysed using ModFOLDclust2 [
42] and manually inspected, showing all models had the same fold. The RaptorX models were selected as the best representative models according the ModFOLDclust2 score and manual inspection. Structural superposition of the RaptorX models was undertaken using the TM-align method [
43], which produces a TM-score between 0 and 1, with scores above 0.5 indicating the same fold and scores close to 1 indicating a high degree of structural similarity of the two proteins.
Conclusions
The present study considers an Emergence-Selection-Diffusion model (ESD) for description of biological information related to mutations accumulating in parasites during the early time period of emergence of artemisinin resistance (2008–2011). This descriptive study showed that random drift in subpopulations may help the emergence of mutations in specific biological pathways. The 168 genes set was associated with cell signaling, gene expression, organelle functions and surface antigens such as msp1 and some var genes. Analysis allow the detection of common features between resistant subpopulations that are result from the selection of drug resistant parasites. The 57 background genes found in the present study, associated with the selection of k13 mutants, are encoding proteins involved in ubiquitination, autophagy and cell signaling enzymes such as ARK2, PI3K and PDE1 which are involved in cell division and differentiation. Relationship with other malaria drug resistance markers was also present in the subpopulations. Possible resistance to tetracycline emphasize that the population structure in the GMS is promoting rapid emergence of drug resistance as hypothesized by the population shifting balance or metapopulation theories.
Diffusion of the
k13 alleles related to the early time of emergence of artemisinin resistance in Cambodia have still not diffused outside of the country, at least up to 2016 when the last large survey was performed [
5,
10]. Nevertheless, the present study emphasizes the important role of KHA parasites in diffusion of artemisinin resistance in Cambodia, which are more prone to crossing with other parasites according to the ESD model and similar population genetic theories. The fear of diffusion of
k13 alleles outside GMS is real now, as some are fixed in KHA parasites, but this will need to be confirmed experimentally. The functional analysis of genomic data also suggests that biological functions associated with the 57 background genes could either be the result of environmental pressure or intrinsic consequence of genetic drift in parasites with low fitness. Several resistant markers to known drugs were found in KHA samples. This population may play a central role facilitating emergence of resistant parasites under the pressure of anti-malaria drugs. The present study of the early emerging artemisinin resistant populations suggest that newly emerging parasites are susceptible to be resistant to several anti-malarial drugs. The fitness of such parasites is under question, but tracking the diffusion of these new genetic background outside the GMS should certainly be extend to additional markers than the
k13 gene alone.
Authors’ contributions
Conceived and designed experiments: CR, DM, and EC. Retrieved and analysed sequence data: AD, MH, SM and ER. Generated structural and functional annotation: AD, NK, DBR and EC. Generated scripts for mathematical and statistical analysis and figures: AD, CR, AK, DBR, JC and EC. Interpreted the data and provided critical discussion: AD, CR, ER, DM, RF, CBM and EC. Wrote the manuscript: AD, RF, CBM, JC and EC. All authors edited the manuscript. All authors read and approved the final manuscript.