Background
While genetic factors play a crucial role in the emergence of drug resistance within
Plasmodium falciparum, many aspects of the genetic epidemiology of the parasite remain obscure [
1,
2]. The beginnings of a global perspective on the genetic structure of parasite populations emerged from the analysis of whole-genome sequencing data (WGS) derived from ~200 parasite genomes collected directly from clinical patients in six countries on three continents [
3]. This study gave further evidence for the widespread presence of within-isolate strain mixture and significant amounts of variation in its degree across continents. In grappling with the complexity of WGS read count data, the study departed from standard approaches for quantifing the amount of within-sample variation by instead using an inbreeding coefficient,
\(f_{ws}\), a form of F-statistic.
Strain mixture has been traditionally assessed via multiplicity of infection (MOI) [
4‐
6], using methods for inferring the number of strains from single-nucleotide polymorphisms (SNPs) or other typing technologies applied at a small number of loci. Researchers have subsequently shown how finite mixture models can infer MOI using WGS but the under the heading of complexity of infection (COI) as these models can capture additional mixture features [
7,
8]. Still, inbreeding coefficients have a long connection to population genetics and conservation biology and may be of interest to researchers connecting
P. falciparum studies to other genetic contexts [
9,
10]. This paper presents a collection of statistical methods for estimating
\(f_{ws}\), explores their performance in simulation, details their connection to COI estimates, and confirms the variation in
\(f_{ws}\) values across countries using the
P. falciparum 3000 genomes (PF3K), a publicly available data resource.
Inbreeding coefficients and the F-statistics from which they derive are measurements of the departure of allelic heterozygosity observed within a population from those expected at Hardy–Weinberg equilibrium (HWE) [
10,
11]. HWE specifies the distribution of alleles assuming panmixia, a population exhibiting perfectly random mating with an absence of mutation, migration, drift, selection or other effects [
12]. F-statistics calibrate the empirical allele distribution within a subpopulation against those expected under HWE, ranging from a value of one (no mixture) to zero (perfect HWE-type mixture). In the context of comparing the parasites’ genetic diversity within a single infected individual relative to the local geographic population (and absent any geographic structuring of the population, i.e. the Wahlund effect), these statistics effectively become inbreeding coefficients.
\(f_{ws}\) denotes the relative amount of inbreeding within an individual sample (
w) relative to the expected amount in a subpopulation (
s). Since here estimates are only considered only relative to a single country (subpopulation), the use of paired subscripts,
\(f_{ws}\) is deprecated in favour of
\(f_i\) for a specific sample
i.
F-statistics have proven to be an effective and extremely popular means for investigating species’ population structure from both allelic and genomic data [
10,
13,
14]. However, standard software tools assume specific ploidy structures incommensurate with WGS data from
P. falciparum and so cannot be used directly. The critical difference is that, within a human host,
P. falciparum exists only in the haploid stage of its life-cycle [
15]. Since short-read WGS data cannot yet capture full haplotypes, individual reads cannot be uniquely identified with their strain of origin. Without being able to associate reads to individual
P. falciparum strains, no ’out-of-the-box’ use of standard
F-statistics approaches with this new data appears possible.
Several earlier works have applied the F-statistic framework to
P. falciparum within-sample mixture. These concepts—while not under the heading of inbreeding coefficients—undergird much of the seminal work on MOI estimation [
5,
6]. More recently, Manske et al. [
3] provided an initial estimator for inbreeding coefficients using WGS based on the slope of a modified regression line between the expected and observed heterozygosity within a sample. Auburn et al. [
16] explores the connection between this estimator and standard MOI approaches by comparing these estimates with MOI values inferred by genotyping the
msp-1 and
msp-2 genes, showing strong correlation between these values in their sample sets.
This estimator has been further utilized in a number of recent studies on
P. falciparum, including analyses of populations in the Gambia, Ghana, and Guinea [
17,
18]. It has also been used in analysis of the population structures of
Plasmodium vivax and
Plasmodium knowlesi [
19,
20]. A recent examination of this estimator in the context of microsatellite genotyping explores a strong relationship between the number of variants, allele frequency, and estimator performance[
21]. There has been otherwise little statistical work characterizing this estimator or it’s properties. This paper seeks to remedy some of this deficit by providing: a simple presentation of this estimator; a set of alternate estimators that make stronger connections to the tradition around
F-statistics; an investigation of their properties through simulations; and several applications to relevant data sets.
This paper proceeds as follows. First, an overview of the data and the notation is provided. The initial estimator employed by Manske et al. [
3] for estimating
\(f_i\) is then reviewed, followed by the presentation of two additional frequentist estimators. A Bayesian approach for these statistics is then derived from the the Balding–Nichols model. All of these estimators are compared in an extensive simulation study. To consider their empirical performance, the correlation across all estimators in 344 Ghanaian samples is examined and the Bayesian estimates are compared to COI estimates. To show the performance under controlled circumstances, we apply the methods to several clonal laboratory strains. As a final example, the estimates for the PF3K sample set are presented for each country, confirming significant variation in the amount of within-sample mixture across countries. The conclusion provides brief discussion of the strengths and limitations of the approaches, and possible future directions for modelling within-sample mixture using WGS.
Discussion
This work presents several new approaches to inferring inbreeding coefficients using read counts from WGS, including a frequentist estimator that is significantly simpler and more intuitive than the initial estimator as well as a Bayesian approach that derives from a classical population genetics model. These approaches help connect MOI investigations to a broader set of work within population genetics and conservation ecology that may be helpful in control efforts [
31]. This work also demonstrates a strong correlation between these metrics and the results of more complex mixture models for inferring COI [
7,
8]. While not intended to supplant these more involved methods for investigating the within-sample mixture, this additional tool can assist researchers in connecting
P. falciparum population genetics to a larger literature. To assist other researchers, the implementaton of these methods is also provided as an
R package
,
pfmix
, with tutorials and example datasets in an open-source framework at the package site, along with the direct estimates for the PF3K data set[
30].
The model underlying the inbreeding coefficient makes a number of assumptions about the structure of the read count data and the biological mixing process that may affect inference. For the read count data, read counts are assumed to be unbiased and the SNPs are unlinked. While short read data can be biased in several ways, previous research indicates that mixture proportions calculated by read count ratios are largely unbiased (for instance, see [
3] supporting information). However,
P. falciparum exhibits significant linkage disequilibrium on scales significantly larger than the average distance between neighbouring SNPs in the data. This violation is not expected to bias the estimates as this absence of independence occurs (roughly) evenly across the genome. However, inference from a small region of the genome will likely exhibit bias.
A perhaps more troublesome assumption is embedded in the underlying structure of the F-statistic. An F-statistic measures the departure of the observed number of heterozygotes relative to those expected under Hardy–Weinberg equilibrium. In the context of mixed P. falciparum infections, the equilibrium assumptions—random mating, no selection, large population size, genetic isolation—are likely each violated at some level. For example, the mixture within a sample may be the result of a small number of founding individuals or be strongly selected by the human immune system. Without a more general approach to understanding the mixing process, anticipating the robustness of these estimates to this sort of misspecification is difficult. However, we do find that the PF3K samples from Cambodia that possess quite significant population structure still exhibit strong correlation between \(f_w\) and the inferred number of strains.
As genomic data enables more elaborate statistical models for mixed infections and a broader understanding of
P. falciparum genetic epidemiology, it will still be useful for field researchers to connect their work with population genetics and ecology through simple metrics. These issues are also relevant for researchers in a number of other
Plasmodium species and protozoa with similar life-cycles. Inbreeding coefficients, which have a history going back to the beginnings of modern genetics, connect to a number of population genetic quantities such as effective population size and genetic drift [
9,
32,
33] and may serve to complement traditional MOI values and newer models to this end. This work meets this need by providing a basis to infer these quantities and a suite of open-source tools for researchers to calculate them.