The application of whole genome sequencing (WGS) is gaining traction across malaria endemic countries. With the resulting development of Plasmodium parasite genomic databases (“big data”), there is an opportunity for the implementation of machine learning methods to inform disease control. The detection of genomic signatures of selective sweeps resulting from the spread of mutations associated with anti-malarial drug resistance is one application of WGS data. This work presents a supervised (deep) learning approach (DeepSweep), which after being trained on haplotypic “images” of established drug resistance genes in P. falciparum and P. vivax parasites, resulted in the identification of loci known to be under recent positive selection. Whilst the strength of sweep signals per locus found by DeepSweep correlated with established EHH methods (e.g. between population Rsb), the machine learning approach has the advantage of not requiring a rigid definition and calculation of population-genetic statistics, incorporating information within and across populations, and relatively lower requirements for the pre-processing of raw SNP data. Like other machine learning approaches, it has the potential to scale up to large numbers of samples, and is parallelizable across genomic regions, thereby making it a potentially useful “big data” tool. In the absence of sufficient computational power, it is possible to develop sampling strategies that can select the subset of the data and samples that contain the highest density of information relevant to DeepSweep. Different model structures were assessed, but performance could be improved by further fine tuning of model hyperparameters (e.g. the number and size of the convolutional filters).
DeepSweep detected a set of loci not detected by the EHH methods, potentially because a deep learning approach can holistically incorporate information from the raw SNP data, which could be fragmented across separate populations and genomic windows, for the calculation of population-genetic statistics. Indeed, the simulation study demonstrated the potential of including haplo-images with not only single, but multiple populations, to allow the algorithm to take advantage of features that are common across regions and be robust to different stages of the sweeps. However,
DeepSweep does require “representative” positive training examples, and in the context applied, assumes that the training drug resistance related loci have undergone or are undergoing selective sweeps in some of the populations. This assumption is not unrealistic given that some antimalarial drugs have been rolled out in different populations at different times resulting in differential stages of selective sweeps [
40]. The
DeepSweep and EHH approaches, as well as alternative methods (e.g. HaploPS [
45]), can be considered complementary and could be run in parallel. However, as these approaches will increasingly use WGS, there are general challenges that affect variant-calling and ascertainment (e.g. extreme genome GC content), which can impact on the density and accuracy of genomic variant inputs, as well as the final population genomic analysis. Typically, WGS analysis leads to a dense set of well supported variants in robust genomic regions, with the application of calling algorithms incorporating information on known high quality polymorphisms [
6]. Further, highly variable or problematic regions, such as
var genes in
P. falciparum, are typically removed from analysis [
46]. In general,
DeepSweep appeared to perform well across different GC content settings (
P. falciparum 19%,
P. vivax 58%), as well as in a simulated data setting which did not impose any constraint on GC content. However, in general, it is important to evaluate the quality of genomic variants used in an analysis. A further consideration is that most approaches use haplotype data, which in the human context require phasing from genotypes. Whilst the
Plasmodium life cycle involves haploid asexual stages, complex clinical infections can complicate and confound population genetic analyses, and therefore analysis was restricted to infections with a dominant clone. However, it may be possible to extend
DeepSweep to process individual parasite sequences for samples with multiplicity of infection. Irrespective, any novel loci identified should be confirmed through functional work [
47]. Further, complementary methods that look at isolate relatedness, as determined by identity by descent (e.g. IsoRelate [
48]), could also be implemented. New loci detected by
DeepSweep that were not identified by other methods (e.g. on chromosomes 6, 8 and 14 for
P. falciparum and on chromosomes 6, 7 and 14 for
P. vivax) provide interesting candidates for confirmation studies.
A potential future opportunity is to apply models across species, for example, to detect P. falciparum loci after being trained on P. vivax signatures, and vice-versa. Such an application could assist to detect regions where drug resistance loci are unknown or less established, such as P. vivax. However, the impacts of differences in sample size and degree of polymorphism between species need to be considered. Relatedly, “real data” was used for training, but an alternative may be to use coalescent or forward-in-time simulation to create positive and negative labelled exemplars. However, there is a risk that images might not be representative of actual selective sweeps in nature. The deep learning algorithm has applications beyond positive selection, including for other evolutionary signatures (e.g. balancing selection) or application to other organisms (e.g. mosquitoes and humans).