Applications of parasite genetics to spatial epidemiology of malaria
Molecular tools may be most valuable when epidemiological information is scarce and/or mobility data is unavailable. Genomic surveillance and phylogenetic analyses that relate the geographic distribution of genetic signals within and between populations have enabled near real-time estimation of transmission chains for non-sexually recombining, rapidly evolving pathogens (e.g., Ebola, influenza) [
16,
17]. This nascent field of pathogen phylogeography has provided key insights into the routes of pathogen introductions and spread, particularly for viral diseases. However, directly extending these methods to a pathogen such as
Plasmodium falciparum—a sexually recombining eukaryotic parasite with a complex lifecycle—requires both molecular and analytic advancements that are still at the early stages of development. In particular, the malaria parasite
P.
falciparum undergoes obligate sexual recombination and is often characterized by multi-genotype infections and low-density chronic blood-stage infections that can last for months in asymptomatic individuals. More complex still are the many challenges associated with the second most abundant cause of malaria,
Plasmodium vivax [
18]. Unlike
P.
falciparum parasites,
P. vivax parasites can survive for months or years as dormant hypnozoites in the liver, where they are undetectable, and can relapse and cause blood-stage infection at any time. Since genetically diverse hypnozoites can build up in the liver, relapses lead to an even greater abundance of multi-genotype blood-stage infections and thus more frequent recombination between genetically diverse parasites. Moreover, in regions of ongoing transmission, relapses cannot be definitely distinguished from reinfections due to new mosquito bites, further complicating efforts to spatially track
P.
vivax infection. These complexities mean that standard population genetic or phylogenetic approaches do not effectively resolve relationships between malaria parasite lineages [
19]. Therefore, new tools are needed for the effective molecular surveillance of both parasite species.
Most national control programs are interested in spatial scales that are operationally relevant, namely within a given country or between countries if they are connected by migration. Population differentiation on international and continental geographic scales can be identified using principal component analysis, phylogenetic analysis, and the fixation index (
FST) [
20‐
24], yet these methods are not powered to detect finer-scale differentiation. This is because (1) recombination violates the assumptions underpinning classic phylogenetic analyses [
25], and (2) principal component analysis based on a pairwise distance matrix and
FST is influenced by drivers of genetic variation that act on a long time scale (i.e., the coalescent time of parasites) such that if migration occurs multiple times during this time frame, there will be little or no signal of differentiation among populations [
26,
27]. In contrast, methods that exploit the signal left by recombination (rather than treating it as a nuisance factor) may have the power to detect geographic differentiation on spatial scales relevant for malaria control programs.
Recombination occurs in the mosquito midgut when gametes (derived from gametocytes) come together to form a zygote. If the gametes are genetically distinct, recombination will lead to the production of different, but highly related, sporozoites (and thus onward infections). These highly related parasites would tend to have genomes with a high degree of identity. Perhaps the simplest measure of this genetic similarity is “identity by state” (IBS), which is defined as the proportion of identical sites between two genomes and is a simple correlate of genetic relatedness between parasites. However, IBS makes no distinction between sites that are identical by chance and those that are identical due to recent shared ancestry, making it sensitive to the allele frequency spectrum of the particular population under study. Analyses that are probabilistic (e.g., STRUCTURE [
28]) provide better resolution, but ultimately linkage disequilibrium-based methods, such as identity by decent (IBD) inferred under a hidden Markov model [
29,
30] and chromosome painting [
31], provide greater power. These IBD methods harness the patterns of genetic linkage disequilibrium that are broken down by recombination and are therefore sensitive to recent migration events and useful at smaller geographic scales. Additionally, they take advantage of the signals present in long contiguous blocks of genomic identity, which can be detected given a sufficient density of informative markers. The exact density required is a topic of current research and depends on the level of relatedness, required precision, and the nature of the genetic markers in question (e.g., the number and frequency of possible alleles for each marker).
In low transmission settings, such as Senegal and Panama, STRUCTURE as well as IBS (which approximates IBD, albeit with bias and more noise), can often be used to cluster cases and infer transmission patterns within countries [
32‐
34]. In intermediate transmission settings, such as coastal regions of Kenya and border regions of Thailand, where genetic diversity is higher, IBS, IBD, and relatedness based on chromosome painting have been shown to recover genetic structure over populations of parasites on local spatial scales [
27,
35]. However, due to dependence on allele frequency spectra, IBS is not as easily comparable across datasets and, as mentioned above, can be overwhelmed by noise due to identity by chance. Moreover, all of these methods currently have limited support for polyclonal samples. In high transmission settings, the complexity of infection is very high, making it difficult to calculate genetic relatedness between parasites within polyclonal infections or to estimate allele frequencies across polyclonal infections since the complexity entangles the signal from the genetic markers belonging to the individual clones, the number of which is unknown. Methods to disentangle (i.e., phase) parasite genetic data within polyclonal infections are being developed [
36], while THE REAL McCOIL [
37] has been developed to simultaneously infer allele frequencies and complexity of infection, allowing downstream calculation of
FST. However, to fully characterize genetic structure at fine scales in high transmission settings, new methods that estimate IBD and other relatedness measures are needed to infer ancestry between polyclonal infections. Indeed, across all spatiotemporal scales and transmission intensities, we propose that rather than being defined by the transmission of discrete (clonal) parasite lineages, malaria epidemiology may be best characterized as the transmission of infection states, often comprised of an ensemble of parasites. Subsets of these ensembles are often transmitted together by a mosquito to another person, and therefore, the combination of alleles/parasites present in an infection state provides rich information about its origin(s) beyond the composition of individual parasites.
Current sampling and sequencing strategies for genomic epidemiology of malaria
The use of genetic approaches described above will depend on the routine generation of parasite genetic data since any molecular surveillance system will improve with more data and must be tailored to the sampling framework and sequencing approach. To date, many studies attempting to obtain epidemiologic information from genomic data have taken advantage of existing samples rather than having sampling tailored to the questions and public health interventions of interest. This is understandable given that a number of these studies have been exploratory and that informed decisions regarding sampling require a priori empiric data on parasite population structure (unavailable in most places) and a predetermined analysis plan (difficult when analytical approaches are actively in development). A more direct/tailored study design should be possible as more parasite genomic data become available and analytical methods mature. However, in general, a greater sampling of infections will be required to answer fine-scale questions regarding transmission (e.g., whether infections are local versus imported, determining the length of transmission chains) than for larger-scale questions such as relative connectivity of parasite populations between distinct geographic regions. Now that sequencing can be performed from blood spots collected on filter papers or even rapid diagnostic tests, collecting samples from passively detected symptomatic cases at health facilities offers the most efficient means of collecting large numbers of infected cases, often with high parasite densities, thus making them easier to genotype. Nevertheless, while this may be sufficient to characterize the underlying parasite population in some settings and for some questions, in others, the capture of asymptomatic cases through active case detection may be essential to understand transmission epidemiology, e.g., to determine the contribution of the asymptomatic reservoir in sustaining local transmission.
The discriminatory power of the genotyping method will depend on the local epidemiology and transmission setting. The two most common genotyping approaches, namely relatively small SNP barcodes and panels of microsatellite markers [
38], have been extensively used to monitor the changes in the diversity and structure of the parasite population. However, signals in these markers may not be sufficient to distinguish geographic origin and have limited resolution in certain transmission settings [
37,
39,
40]. Increasing the number of loci and/or discrimination of each locus may be necessary to answer the questions relevant to elimination. Further, increasing discrimination by using multiallelic loci has particular advantages since these may provide more information content than biallelic loci [
41]. This is particularly true in polyclonal infections, frequent even in areas close to elimination, because heterozygous genotypes of biallelic loci contain little information (all possible alleles are present), whereas detecting, for example, 3 out of 20 potential alleles in an infection, still allows informative comparisons between infecting strains. In addition, some genotypable multiallelic loci contain extremely high diversity, which can be combined in relatively small numbers to create high-resolution genotypes. Targeting specific regions of the genome for sequencing after amplification by PCR (amplicon sequencing) or other methods, such as molecular inversion probes [
42], offers efficient approaches to genotyping multiallelic short-range haplotypes, SNPs, and/or microsatellites, providing a flexible platform for deeper and more consistent coverage of regions of interest at lower cost than whole genome sequencing. Amplicon sequencing may be of particular interest for genotyping minor strains in polyclonal infections and/or low-density samples, whereas molecular inversion probes may excel for more highly multiplexed marker assays where capturing low-density samples is not critical. Identifying a panel of optimally informative genetic markers to address a specific question remains a major challenge that must balance the cost, throughput, and discriminatory power. For example, at fine geographic scales, larger numbers of more closely spaced markers with representative coverage of the genome may be required in contrast to studies comparing distant parasite populations; the density at which infected individuals are sampled and the underlying diversity and genetic structure will also affect the number and type of loci required.
With proper consideration, a parsimonious set of genetic targets may be identified as useful to answer a number of general questions regarding malaria genomics. Nonetheless, the development of a marker toolbox and genotyping methods tailored to answering questions relevant for transmission at different spatial scales is an important goal. To this end, several ambitious sequencing studies have begun, and over 4000
P.
falciparum genomes have been sequenced from different transmission settings around the globe (such as the Pf3K Project,
https://www.malariagen.net/data/pf3k-pilot-data-release-3) [
40,
43,
44]. These genetic data are all publicly available, providing a crucial framework to build upon when designing more local, sequence-based epidemiological studies that balance the trade-off between the number of genetic loci evaluated and the quality of the data (e.g., depth of sequence coverage) for each parasite sample. Genomic sequencing methods are evolving rapidly towards high-throughput and low-cost, deep sequencing approaches that can be performed on routinely collected patient samples, allowing for evaluation of even asymptomatic low-density infections, e.g., by selective enrichment of parasite DNA [
45,
46]. These enrichment methods can exacerbate the non-uniformity of sequencing coverage variation across the parasite genome and can require specialized filters to remove erroneous heterozygous calls, yet they generally produce genotypes exhibiting very high concordance with those from samples sequenced via alternate means [
46,
47]. Preferential amplification of dominant strains in a polyclonal infection (i.e., missing minority clones) and the inability to detect copy number variation have also been described as potential limitations of these selective enrichment methods [
47]. Nevertheless, despite these limitations, these methods are enabling cost-effective whole genome sequences from routinely collected blood samples. Moving forward, we must ensure that rich metadata are made easily available in the context of genome sequences, so that links can be made to experimental, epidemiological, and ecological variables and models.