This study described the development of laboratory assays with multiplex PCRs followed by NGS, a unique bioinformatic process with B4Screening pathway and a novel threshold-calibrated MOI estimation method capable of detecting multiple-strain infections of
P. falciparum parasites in artificially mixed laboratory strains and field isolates. Using this pipeline, the 24 barcode SNPs were identified successfully and uniformly from the 12 chromosomes of
P. falciparum in a sample. The impact of pre-amplification, parasite concentration and strain proportion on the SNP frequency of mixed laboratory strains were evaluated. Parasite concentration within the tested range and pre-amplification did not influence the SNP frequency of strain within the same proportion, which allowed the evaluation of the field samples with a range of parasite densities with and without pre-amplification. Both DBS and frozen blood yielded sufficient reads in each SNP location analysed. Based on consistent barcode 24 SNP frequency calls at targeted locations, the algorithmic reconstruction of strains for each sample using a novel published StrainRecon [
17] reconstructed the barcodes of dominant strains with 98.5% accuracy. In field samples from western Kenya, up to 5 strains in a sample were identified using same tools described above and temporal changes in 24 SNP barcode-based MOI could be reliably estimated using the novel threshold-calibrated STIM method developed in this study.
Pipeline development
Multiplex PCR is a fast and cost-saving approach for pathogen diagnosis and genotyping [
29,
30]. Three multiplex PCRs covering all 24 barcode SNPs identified by Daniels et al. [
19] within the
P. falciparum genome were developed in this study. The 24 short target regions needed to be amplified with comparable efficiency in multiplex PCRs to ensure generation of sufficient SNP coverage at each location via NGS. The 3 optimal multiplex PCRs were developed through a series of optimizations and pilot testing.
The most important and challenging issue was how to uniformly call the 24 barcode SNP frequencies based on NGS data and assign these frequencies to haplotypes. The second, interlinked, challenge was to determine how many strains were represented in each sample. To resolve these challenges, the StrainRecon mathematical algorithm for strain disambiguation [
17] from different chromosomes and different SNPs [
19] was leveraged in this study. It is important to point out that the consistency of SNP frequency calls at targeted SNP locations determines the ability of the StrainRecon and thus STIM to successfully assign these frequencies to haplotypes and disentangle the multiple strains within a sample and MOI estimation.
To maximize the consistency of SNP frequency calls from NGS data for haplotype assignment and strain reconstruction, the unique bioinformatics pipeline with B4Screeining pathway developed in this study first removed spurious amplicons introduced by sequencing. Initial data trimming and cleaning steps, by CutAdapt and Prinseq and data quality visualization via FaQCs prior to entry into the Bioconductor pipeline, ensured only sufficient quality, trimmed reads were processed. Without proper screening of noise, the extraneous element could erroneously generate orders of magnitude more sequences than truly exist in the data. There were numerous low copy number amplicons that did not reflect true diverse amplicons removed at population level using the novel and sensitive RF classifier developed in this study.
While this novel bioinformatics pipeline currently applies to the described 24-SNP malaria barcoding scheme, the implications of data cleaning steps here indicate the importance of careful evaluation of NGS output in any non-template-driven systems, such as whole genome multi locus sequence typing (wgMLST) or other barcoding approaches. The majority of unique amplicons (yet minority of total reads) generated in the laboratory and subsequently removed by routine bioinformatics processing did not reflect intended targets. In addition, chimerism between multiplex PCR targets was substantial in raw data. Therefore, leaving primers that are clearly distinct between sites on reads during the processing allowed a quick screen for chimeras. Incorporating this step prior to primer trimming also improves data quality and efficiency of analysis.
In the bioinformatics pipeline, where only information regarding known specific SNP sites (binary) was incorporated, the finding of multiple distinct point mutations per amplicon was not utilized. This latter information (category) can be incorporated in future work not only to enhance 24 SNP frequency calls but also to be useful for amplicon deep sequencing data analysis.
Although laboratory or clinical samples can have high parasite densities, field samples from population-based community surveys are collected largely from asymptomatic individuals who tend to have low parasite densities. In addition, in high and medium transmission areas, a minor proportion of parasite strains in a sample is often undetectable using conventional molecular technologies [
31]. Both low parasite density samples and low proportion of parasite strains in a sample could increase difficulty in producing consistent frequencies across all SNP locations due to the limited DNA template availability. In this study, the impact of pre-amplification, parasite concentration and strain proportion on consistency of 24 SNP frequency calls were evaluated. It showed that SNP frequencies were not influenced by parasite concentration within the currently tested range or by pre-amplification, but differed by strain proportion. It is easy to envision that SNP frequency calls are entirely based on original DNA template diversity when noise and spurious amplicons from NGS are minimized using the unique bioinformatics developed in this study. Most importantly, the results have practical applications. First, it allows evaluation of field samples with a range of parasite densities, with and without pre-amplification. Second, pre-amplification allows evaluation of samples that have insufficient parasite concentration for analysis of diversity, but sufficient concentration to amplify without this step. Critically, NGS is sensitive for detection of low frequency SNPs in a sample [
32,
33]. Compared to existing conventional molecular tools where minor parasite proportion below 10–30% in a sample were generally undetectable [
9,
34‐
36], the tools developed in this study detected the barcodes of dominant strain with 98.5% accuracy and the proportion of parasite strains ranging from 2 to 10% in a sample with accuracy between 77.5 and 89.3%. Although the target lowest limit for minor strain detection was designed at 0.5 parasite/µl (see second subsection for laboratory assay development in
Methods), this study did not attain such a fine level of resolution due to the inaccuracy observed at lower proportions (0.5–2%) of strains, specifically that as the strain proportion decreases, the ability to accurately detect the minor strain diminishes. This lower bound is influenced by both background noise level in the current analysis pipeline and the low DNA copy number of minor strains in the sample, decreasing expected precision. Nevertheless, the results from this study showed a substantial improvement in sensitivity for detecting minor parasite populations in a sample, particularly those above 5% proportion, with acceptable precision. While deep sequencing of individual targeted antigenic genes also provides ability to detect gene-specific minor variants in a sample, the estimates in the highly selected genes might not represent true genomic signatures of parasites [
14,
15,
37] and may offer limited temporal and geographic discrimination between parasite populations [
15].
The STIM algorithm for assessing MOI based on SNP read frequencies relies crucially on noise levels. Highly variable input in a noisy pipeline may cause MOI to be overestimated by confusing noise with true signal in reconstruction. A threshold was placed on MOI misfit, and calibrated by balancing the false negative rate of the strain reconstruction quality of StrainRecon on the known laboratory samples to the false positive rate and number of strains estimated. The same threshold was then used for running STIM on field data under the assumption that most noise would be from the pipeline steps against which the threshold already accounted. Importantly, the trends and conclusions from the field continue to hold even if the true threshold for this was slightly shifted relative to the one determined by the laboratory strain setting. In other words, MOI value estimated by STIM is subject to noise, whereas temporal changes in MOI as estimated by STIM are resilient to such noise. A large study across geographic regions is ongoing to examine the robustness of the STIM method in the field as well as potential needs for further calibration of the proposed threshold with a richer set of artificial strain mixtures.
Field results
MOI has only recently been used as a metric for malaria transmission. Therefore, the EIR and malaria prevalence in children 5 years old and younger from same study area were obtained from 1995–1996 to 2012 (Fig.
5a, b, data extracted from both published [
38] and unpublished KEMRI/CDC data) for side-by-side comparison with the MOI estimated from current study (Fig.
5c, d). The results show that the EIR sharply declined between 1995 and 2001 and remained low, even as malaria prevalence gradually decreased between 1996 and 2007, then reaching a plateau between 2007 and 2012. In comparison, the average MOI gradually declined over time and the percentage of samples with 5 strains dropped from 57% in 1996 to 18% in 2012 (during which period the proportion of one-strain samples increased). Since there was no correlation between MOI and age, the decline in MOI over time is unlikely to be confounded by host age [
20]. Overall, the decreases in both average MOI and proportion of samples with 5 strains over time are in tandem with the decline in EIR and malaria prevalence; but the turning points are different. Specifically, MOI shows slow reduction, EIR has a sharp decline, and malaria prevalence stagnates from 2007 to 2012. This suggests a non-linear scaling relationship among the three malaria metrics [
3]. The reasons behind the slow reduction in MOI are unclear; the large number of distinct strains detected in the area may play a role (878 distinct strains at least 5% proportion were detected in total and are reported in
Results). Nevertheless, the MOI, which provides the information of strain numbers within a host, is a higher resolution parasite index compared with the malaria prevalence index and it might represent true transmission level. A further study of the parasite strain population size and strain relatedness is needed using this dataset.
The tools developed in this study advance both the estimation of number of strains within a host but also the number of strains at a population level, enhancing the resolution for MOI estimation. This advantage is particularly obvious compared to the original Taqman PCR 24 SNP barcode assay and COIL analysis for complexity of infection (COI) estimation in which monomorphic or polymorphic genotypes within each sample are estimated [
19‐
21]. Taken together, the combined approach established in this study could be used for MOI estimation, particularly for temporal changes in MOI in regions with medium to high transmission levels. A large-scale validation study is being conducted using samples from different malaria countries/regions with heterogeneous transmission intensity.