Background
The malaria parasite
Plasmodium falciparum is the most prevalent malaria species found on the African continent [
1] and is responsible for 90% of deaths from malaria [
2]. The NF54 isolate derives from an infection obtained near Schiphol Airport in the Netherlands [
3]. Two sibling parasites, 3D7 and E5, were independently isolated by limiting dilution from the original NF54 culture [
3,
4]. The 3D7 clone has been used in the malaria genome sequencing project [
5], revealing that the
P. falciparum genome consists of a 23 Mb nuclear genome with 14 chromosomes and around 5500 genes [
6]. E5 was incidentally indentified during transfection experiments of the original NF54 culture [
4] and has previously been characterized by PCR cloning and gene specific PCR [
7]. Whole genome sequencing of
P. falciparum is complicated by the special properties of the
P. falciparum genome: it is very AT-rich and contains many repetitive regions and homopolymer runs, especially in its intergenic regions, complicating the assembly of genome data [
8‐
10]. It has recently been shown that the genome can be divided into a core genome (95%) hypervariable regions (5%) [
10,
11]. Unambiguous alignments of sequence data of different strains are only possible in the core regions and not in the hypervariable regions that harbour the majority of the variant surface antigen (VSA) gene families [
12‐
14].
To date, five multicopy gene families that encode VSAs have been described in
P. falciparum:
stevor (subtelomeric variable open reading frame) [
15],
rif (repetitive interspersed family) [
16],
pfmc-
2tm (
P. falciparum Maurer’s clefts two transmembrane) [
17],
surfin (surface associated interspersed genes) [
18] and
var [
19]. The best investigated VSA is
P. falciparum erythrocyte membrane protein 1 (PfEMP1) [
6,
20,
21]. PfEMP1 is encoded by the multicopy
var gene family that consists of about 60
var (variability) genes per
P. falciparum genome [
19]. Antigenic variation is primarily mediated by mutually exclusive expression of 1 of the 60
var genes per infected red blood cell. The subtelomeric position of most
var genes [
14] predisposes them to recombination contributing to the diversity of PfEMP1 [
22]. PfEMP1 is transported to the surface of infected red blood cells and acts as a receptor for the surface receptors on endothelial host cells. This cytoadhesion prevents clearance of the red blood cells by the spleen. Different forms of PfEMP1 possess different binding specificities and individual PfEMP1 variants have been associated with distinct malaria syndromes such as malaria in pregnancy or cerebral malaria [
23‐
28].
In endemic regions, antibodies to PfEMP1 develop early in life have and have been shown to correlate with the development of protective immunity [
29]. To escape the human immune response,
P. falciparum can switch the PfEMP1-variant expressed on the surface of infected red blood cells. Recent investigations also support a role for the non-PfEMP1 VSA proteins in cytoadhesion, antigenic variation and as targets of the human immune response [
30‐
32]. The non-PfEMP1 VSA families are located in close proximity to the
var genes within the hypervariable regions of the
P. falciparum chromosomes. The chromosomal position of the VSA gene families thus complicates their genetic analysis. Because of this position the VSA gene families were excluded from a recent extensive analysis of progenies of experimental
P. falciparum crosses [
10].
The aim of this work was to characterize VSA-gene family inheritance in a NF54 clone with WGS technology. To provide a framework to investigate identity by descent (IBD) in field isolates a set of 84 microsatellites was evaluated for its ability to distinguish between the 3D7 and non-3D7 parts of the E5 genome. Microsatellites are variable numbers of tandem repeats in DNA [
33]. They have the advantage that they are locus-specific and highly polymorphic. Because most microsatellites are located in non-coding regions they are not subject to purifying selection. The original work by Walliker, Wellems and Su has generated a large repository of MS primers that were originally used to determine the genetic basis of chloroquine resistance [
34] and erythrocyte invasion in progeny of experimental genetic crosses [
35] as well as multiple other fundamental aspects of
P. falciparum biology (summarized in Figan et al. [
36]). MS flanking drug resistance loci have also been employed to determine the size of genetic sweeps in population based studies [
37,
38]. A 12-locus primer set developed by Anderson et al. [
39] has been used by many investigators to assess the genetic diversity of field isolates [
40,
41]. Recently, Figan et al. [
36] identified 12 MS markers that can reliably differentiate progeny from experimental crosses. However, the small number of MS precludes an analysis of chromosomal inheritance. Therefore, here we evaluate a set of 84 microsatellite alleles distributed over the 14
P. falciparum chromosomes to type chromosomal regions as 3D7- or non-3D7.
Genome changes in progeny of a
P. falciparum cross are a consequence of crossover or non-crossover recombination [
10,
42]. Crossover recombination represent, a reciprocal exchange between homologous chromosomes during meiosis, whereas non-cross over recombination results in the duplication of a sequence from a donor sites that replaces a sequence at an acceptor site (also referred to as a gene conversion).
The analysis of E5 offered the opportunity to investigate crossover and non-crossover recombination in a natural sibling of the 3D7 genome clone. Zero to three cross- overs per chromosome were identified. VSA gene families were inherited in their respective parental chromosomal background. The chromosomal distribution of VSA genes in E5 was virtually identical to 3D7. The var and rifin/stevor gene families represented the most genetically distinct parts of the E5 genome. However, only one definite non-crossover recombination event among non-3D7 and 3D7 var genes was detected.
Discussion
3D7 and E5 were both cloned from the original NF54 isolate [
3,
4] and thus represent progeny of a natural genetic cross. Although the parents of this cross are not known, a previous analysis of 32 progeny of the 7G8XGB4 experimental cross [
57] has shown that the two parental genomes are inherited on average at a ratio of 1:1 per progeny. Given that approximately 50% of the E5 genome is identical to 3D7 this suggests that 3D7 is isogenic with one parent of this cross. Thus, analysis of E5 allowed an assessment of chromosomal crossovers as well as non-crossover recombination in a progeny clone of a natural genetic cross.
In this work, the E5 genome was characterized with MS genotyping as well as short and long read WGS techniques. All genotyping approaches suggested a chromosomal recombination rate of 0–3 crossovers per chromosome, consistent with previously reported crossover rates in progeny of experimental genetic crosses [
10,
57]. Similarly, all methods indicated that inheritance of VSA gene families occurred within the context of the respective parental haplotypes. A comprehensive analysis of VSA inheritance was however only possible with long read Pacific Biosciene WGS, because the readlength of > 8000 base pairs enabled an accurate assembly of the highly variable telomeric and central chromosomal parts that harbour the VSA gene families. This analysis showed that the VSA gene families have almost the same number of genes in E5 and 3D7.
Annotation of the E5 genome revealed a total of 5733 genes. This number is slightly higher than the 5500 genes in the 3D7 reference genome and is explained by the fact that companion annotation tool overpredicts open reading frames [
52]. Genome wide comparison by orthomcl-analysis revealed that the E5 and 3D7 genomes consisted of > 95% genes that had orthologues in both genomes. Only approximately 4% of the E5 and 3D7 genes were singletons and the
rifin/stevor and
var genes represented the largest group of genes with known functions among the singletons. Despite this, the total number of identified singleton
var genes was lower in the orthomcl-analysis than the number of unique E5
var genes identified by direct sequence alignment. The underestimation of
var gene diversity by the orthomcl-analysis is likely due to highly conserved exon II sequences. Overall the data are clearly consistent with the previously reported high genetic diversity of VSA gene families compared to the highly conserved
P. falciparum core genome.
The
var gene family has long been shown to be prone to recombination during meiosis [
7,
42,
58‐
60] and mitosis [
9,
61,
62]. Furthermore, several investigations have recently quantified mitotic
var gene recombination rates [
9,
62] in different strains. Analysis of the 3D7 and E5 genomes revealed that E5 had a total of 62
var genes (compared to 61
var genes in the 3D7 reference genome). The “additional” new
var gene was generated by recombination between a 3D7
var gene on chromosome 8 and an E5 specific
var gene on chromosome 14. 3D7 has no full
var gene on chromosome 14, but recently Otto et al. showed that 8 of 10 field isolates carry a
var gene in this subtelomere of chromosome 14 [
11]. This shows that non-chromosomal recombination can expand the
var gene repertoire of individual strains but that the sites of these changes appear to be conserved across different isolates. The presence of an intact “3D7 donor” sequence suggests that the chimeric
var gene is the result of a gene conversion event as it has been reported previously for the
var gene family [
42,
61]. Recently Calhoun et al. [
63] showed that experimentaly induced double stranded breaks are repaired by the “telomerase healing” pathway. Indeed their work showed a similar non-crossover recombination event resulting in the replacement of a chromosome 13 telomere by a chromosome 9 telomere, thereby creating a new chimeric
var gene on chromosome 13. The data presented here thus support a role for telomere healing in the generation of VSA gene family genetic diversity. A previously described chimeric
var gene sequence [
7] that carries a 105 bp 3D7 fragment within the DBL of the E5
var gene was reidentified in the current analysis and the corresponding “3D7 donor”
var gene was localized to chromosome 9. This chimeric sequence is located within a hypervariable DBL block that has been shown to exhibit high sequence variability in field isolates [
64]. Larger population based studies with long read WGS are necessary to determine if this type short chimeric sequence represent true non-chromosomal recombination or simply random sharing of sequences among the global
var gene population.
The VSA gene families of
P. falciparum are located in subtelomeric regions and internal clusters. The boundaries between the VSA containing areas and the stable core genome have recently been newly defined by Otto et al. [
11], through the analysis of 10 newly cultured field isolates from different geographic regions, by long read Pacific Bioscience sequencing technology. The beginning of the subtelomeric region was defined as the point were newly assembled genomes stop aligning with the 3D7 reference genome, however recombination within the subtelomeric regions was not able to be assessed because the analysed strains were not genetically related. In contrast in this work the analysis of the 3D7-type subtelomeric and central areas of the E5 genome with short and long read WGS enabled an assessment of recombination in the VSA harbouring parts of the E5 genome. Analysis of the 3D7-like subtelomeres and internal clusters by short read WGS exhibited moderate SNP frequency and low coverage and thus suggested relatively frequent sequence alterations compared to the 3D7 refrence sequence. This likely reflects the difficulty of short read sequencing technology in the characterization of DNA sequences with high AT content and an abundance of repetitive DNA elements. In contrast long read WGS data of the subtelomeres and central clusters only identified one large scale recombination event showing that most of the 3D7-type subtelomeric sequences were indeed co-linear with the original 3D7 sequences. Together these data indicate that the majority of subtelomeres of
P. falciparum are highly conserved across progeny from genetic crosses and that long read sequencing technology is more appropriate for the characterization of the genome areas harbouring VSA gene families.
3D7 and E5 both originate from the same NF54 culture and, therefore, have been in tissue culture for approximately the same time. The highly conserved nature of the E5 genome parts harbouring the 3D7-VSA gene families suggests that mitotic non-chromosomal recombination alone is insufficient to explain the global genetic diversity of the
var gene family [
65]. This suggests that the selective pressure of the host immune system is essential for the expansion of parasite populations with new chimeric
var genes and thus for the generation of the seemingly endless diversity of the global
var gene repertoire. Furthermore, the high degree of genetic diversity in the
rifin/stevor gene families indicates that these non-PfEMP1 VSAs may be under similar diversifying selection as the
var gene family [
29‐
32].
Larger studies of progeny from natural genetic crosses with long read sequencing technology are necessary to examine the possible role of acquired immunity in the generation the var gene and rifin/stevor genetic diversity at the population level.
While there has been a long standing interest in the analysis of VSA families from different laboratory strains, recently field isolate VSA gene families have moved into the focus. In this context it has become clear that progeny of natural genetic crosses that show IBD are far more prevalent than previously thought [
66].
In order to establish a method that can reliably differentiate between different progeny of a natural genetic cross, a set of 84 MS primers from the NIH database was evaluated for its ability to identify the 3D7 and non-3D7 parts of the E5 genome. 27 MS primers resulted in erroneous genotyping with the PCR conditions applied in this work. This is likely due to the fact that one standardized set of PCR conditions was applied for all primers and no attempts to optimize individual reaction conditions were made. However, even with these standard PCR conditions, 54 of 57 MS genotyping results were confirmed by WGS. 3 MS loci (ebp, hrp2 and C12M30) showed the same alleles in E5 and 3D7, despite being located in the non-3D7 part of E5. Two of these MS were located within the open reading frames of ebp and hrp2 indicating that these genes are not sufficiently diverse to distinguish between sibling parasites.
Comparative genotyping of E5 and 3D7 with 54 MS genotyping was accomplished within a few days and the use of different fluorophores for different MS on each chromosome enabled “head to head” genotyping of individual E5 and 3D7 chromosomes by multiplex PCR-reactions. This is the first time that MS genotyping has been directly compared to WGS. MS length differences of < 3 bp diffrences between E5 and 3D7 correctly identified the 3D7-type parts of the E5 genome. In some of these 3D7-type MS alleles the PCR fragment length differed from the in silico length of the respective MS in the 3D7 genome (version 3). This is most likely due to DNA slippage during PCR DNA replication. However, given the fact the PCR fragment length of these MS were identical after amplification of E5 and 3D7 this phenomenon appears to be higly reproducible and does not lead to erroneous genotyping.
Recently, Figan et al. [
36] identified a set of 12 different microsatellite markers that reliably distinguish between progeny of 4 different experimental genetic crosses. The PCR conditions employed by Figan et al. and the PCR conditions in this work were almost identical suggesting that the two primer sets could be combined for rapid genotyping of field isolates.
SNP barcoding has recently emerged as a genome wide typing technique and has been used to investigate
Plasmodium and the origin of its genotypes [
67,
68]. The barcoding genotyping technique, which is based on a 23 single nucleotide polymorphisms (SNPs) and on high-quality raw sequence data [
69], detects differences in the organelle genomes of
P. falciparum and thus is not suitable for characterization of chromosomal inheritance. Similarly, another SNP assay developed some years earlier, is based on 24 SNP loci that are distributed unevenly across the genome, i.e. some chromosomes do not have SNP markers and others only 1 marker, thus tracking chromosomal cross over events is not possible [
70].
SNP and WGS analysis are expensive and depend on the availability of high quality sequence data as well as extensive bioinformatic expertise. Therefore SNP and WGS can only be applied to subsets of P. falciparum lines and are usually carried out in specialized centres with extensive resources. In contrast MS genotyping and data analysis can be carried out in smaller centres, potentially enabling investigator driven analysis and identification of P. falciparum strains most suitable for subsequent WGS analyses in specialized centres.
The vast majority of the confirmed 54 MS are located in the non-coding parts of the P. falciparum genome. Consequently, they are not under purifying selection and may reflect the underlying genetic plasticity of the P. falciparum genome more accurately than methods that are based on the detection of SNPs of coding regions.
Future analysis of natural P. falciparum cross progeny from semi-immune and non-immune individuals may allow insights into the factors that drive crossover and and non-crossover recombination in P. falciparum. In this context MS genotyping may be used to determine IBD in field isolate progeny and to identify parasites clones most suitable for WGS analysis.