Background
Over 100 anopheline mosquito species can transmit human malaria parasites but there are important differences among these species that influence their role in malaria transmission. Many of these species belong to a sibling complex; a complex is a taxonomic group of morphologically identical, closely related species. In the past, sibling species have been hard to distinguish and complexes have often been treated as a single entity despite important differences among sibling species. In Africa,
Anopheles arabiensis,
Anopheles coluzzii and
Anopheles gambiae from the Gambiae complex and
Anopheles funestus from the Funestus subgroup are undoubtedly the most important vectors transmitting both
Plasmodium falciparum and
Plasmodium vivax parasites to humans [
1‐
3]. Within the Gambiae complex,
Anopheles melas and
Anopheles merus are also considered dominant vectors (“dominant” is defined as a vector species that has been identified as the main, dominant or important vector in at least one region) whereas there is no strong evidence that other species from this complex play any role in malaria transmission [
4,
5].
In addition to differences in the vector status of each species, sibling species also have important differences in their geographical distributions. Previous studies that estimated the geographical distributions of the dominant malaria vectors were hampered by low volumes of data for individual sibling species and had to choose between mapping complexes or incorporating species records that had been determined on the basis of morphology alone and were therefore potentially misidentified [
6]. Furthermore, insecticide resistance in vector species currently threatens the efficacy of vector control [
7], making this a critical factor that needs to be understood within each vector species. In the past, many studies that used susceptibility assays to measure prevalence of resistance in vector populations did not fully identify sibling species. Thus, the mortality values obtained related to the species complex as a whole and potentially important differences among sibling species were not identified.
In recent years, the importance of species identification alongside the availability of accurate molecular identification methods has increased the number of studies reporting reliably identified sibling species. The aim of this study was to use the increasing availability of sibling species records, and an improved species distribution model, to define the geographical distributions of individual vector species within the Gambiae complex and Funestus subgroup in Africa. The available evidence for insecticide resistance was then examined in these species to assess the feasibility of combining insecticide resistance data with the geographical distributions generated. The distributions of An. gambiae and An. coluzzii are modelled separately for the first time and An. funestus is modelled for the first time as the type species distinct from other members of the subgroup.
Methods
Summary of species distribution map generation
Records of sibling species occurrence, where species were identified using molecular methods, were retrieved from the published literature (from both resistance and behavioural studies) and from unpublished sources to compile a set of presence records for each species. A larger dataset, including all Anopheles surveys in the region, was used as a background dataset that captured sampling bias in the presence records. Both datasets informed a species distribution model that identified the combinations of environmental variables that best distinguished areas supporting species presence from the range of environments sampled. This model was then used to estimate the relative probability of species presence at all locations within the species range.
Species background and presence data
Data from two previously collated and publicly available databases of dominant malaria vector species occurrence and bionomics [
8,
9] were combined with a new database of insecticide resistance records (described below) and duplicate records were removed. Searches of the more recent literature were conducted from the dates that the earlier searches finished (2009, 2013 and 2015 respectively) to 29 September 2016, to fill any gaps in the dataset. The new searches used the Web of Sciences bibliography and the search terms “[species name]” and “[country name]”. New records of occurrence that matched the inclusion criteria were extracted and added to the composite database. Only studies that provided the location and time of collection, and gave details of the identification method(s) used, were included.
Geographical coordinates for the collection locations were converted to decimal degrees. For sites where no coordinates were given, coordinates were assigned using the site name and contextual information, such as the district or distance to a major city, using online gazetteers including GeoNames, Google Maps, and OpenStreetMap. All coordinates provided by the source or generated as part of this project were checked to ensure that they matched the sampling design described, fell on land and fell in the correct country, using the geographical information software ArcMAP. If collection dates were missing for a data point, the year of collection was assumed to be two years before the article publication year based on the trend seen for data with a known collection date, for the purposes of this study. For each species, the full species occurrence dataset was classified into (1) studies that used molecular identification methods capable of detecting that species, and (2) studies that would not have detected that species using molecular methods.
To generate a presence dataset for each species, all records that used appropriate molecular identification methods and recorded presence of the species were extracted from the dataset described above. For all studies that identified the species formerly known as An. gambiae (An. gambiae/An. coluzzii or M/S forms combined) using molecular methods but did not identify the M and S forms, occurrences were labelled “An. gambiae (old)” to distinguish them from the newly defined An. gambiae (formerly An. gambiae S form). Records of “An. gambiae (old)” outside the An. coluzzii range plus a 300 km buffer were designated An. gambiae and records inside the overlapping An. gambiae and An. coluzzii ranges, plus buffer, were discarded.
It is not possible to test empirically whether spatial clustering in the presence data is due to habitat suitability or spatial bias in sampling effort, so this was accounted for a priori through the selection of background data with the same spatial bias in sampling effort [
10]. The full mosquito occurrence dataset was used as a source of background data that captures the sampling bias in the environments surveyed for
Anopheles species. Most of the species modelled have ranges that border on desert areas where
Anopheles are known to be absent but no surveys are performed, so 210 pseudo-absences were generated by randomly selecting locations within the desert biome defined by the United Nations Environment Programme [
11].
Summary of available insecticide resistance data for sibling species
A literature search was performed in the Web of Science bibliographic database using the search terms “insecticide resistance” and “anopheles”. Articles of potential interest were identified and their abstracts were scanned to identify studies that had performed a bioassay on field-collected mosquitoes (up to the F1 generation). Full texts were obtained for these articles and data extracted for all bioassays that had used an insecticide from one of the four major neurotoxic classes: carbamates, organochlorines, organophosphates and pyrethroids. Unpublished data were also requested from authors of the published articles and groups working on insecticide resistance. Data fields extracted from both published and unpublished sources covered: species; collection dates; collection location; method(s) of capture; method(s) of identification; insecticide; bioassay protocol; percent mortality; and source citation(s). Data from all bioassays identified were extracted and any deviations from a standard published protocol (for example non-standard exposure times) were noted. Records that did not identify sibling species using molecular methods were discarded for the purposes of the current study. Summary statistics were calculated, based on the mortalities obtained for samples where >95% of the mosquitoes were identified as a single species, to give an indication of bioassay data availability and variation among sibling species.
Species ranges
In order to model the geographical distribution within each species range, previously defined ranges for
An. arabiensis,
An. funestus,
An. melas and
An. merus [
6,
12] were used to limit the extent of the model outputs. These ranges were compared to the presence dataset for each species described above and if confirmed records of the species were found outside the previously defined range, the range was extended to encompass the new location(s). For
An. coluzzii the presence dataset and a previous map showing records of the M form of
An. gambiae [
13] were used to define its range. One record of a single
An. coluzzii mosquito in Zimbabwe shown on the previous map was discarded after a thorough search of the literature found no other record of this species within over 500 km of this location since the original record was published. Individual species ranges and presence points were combined to generate ranges for the Gambiae complex, Funestus subgroup and Funestus group. A 300 km buffer was added to each species range to reflect uncertainty in the exact ranges of these mosquitoes.
Environmental data
The modelling approach used here relies on the relationship between species occurrence and combinations of environmental covariates. Covariate values were extracted from an existing set of spatial data layers for environmental covariates believed to be of importance to mosquito occurrence and malaria transmission [
14]. Full details are provided in Additional file
1. Briefly, data layers at a 5 × 5 km resolution were included for land surface temperature, seasonality in temperature, measures of wetness/greenness, seasonality in wetness/greenness, elevation, proportional cover of 14 land classes, and human population density.
Species distribution model
Each species was modelled separately using the same species distribution model. The approach used was a boosted regression trees method that combines both regression trees (which build a set of decision rules on the predictor variables by portioning the data into successively smaller groups with binary splits) and boosting (which selects sets of trees that minimise the loss function) to best capture the variables that define the distribution of the input data [
15‐
17]. The boosted regression trees methods has been used in previous malaria vector studies [
6] and has recently been updated to use background data that characterises sampling bias in the presence records, and to include changes in land cover over time [
18,
19]. For each model, the presence records for that species and the background data points located within the range of that species, excluding survey records found in the presence dataset, were used. The background data were classified into (1) records from studies that used methods that would have identified the sibling species being modelled (had that species been present in the sample collected), and (2) all other records. Data points linked to a random 10% of locations from both the presence and background datasets were withheld for use in the model validation. Together the remaining presence and background data formed the model training dataset.
The boosted regression trees method requires both presence and absence data, or background data can be used when true absence data is not available. Mosquito occurrence datasets are subject to spatial bias and if unaccounted for this survey bias can translate into environmental bias in the fitted model. The background data used in the study reflected the same survey bias found in the presence data so the model could identify suitable environments for the species within the sampled space, rather than just areas that are more heavily sampled. This approach does not eliminate sampling bias issues entirely but improved model performance has been demonstrated [
10]. The model was updated further in order to weight the background data so that records from surveys using molecular methods that would have identified the species being modelled (had that species been present in the sample collected) received twice the weight of other background data points. Presence and background data from 2001 to 2012 were linked to covariate values for the relevant year in order to improve the predictions where possible. For all records prior to 2001, covariate values for 2001 were used, and for any data collected after 2012, covariate values for 2012 were used. Model predictions were made to the most contemporary covariate data available.
For each species, 200 submodels were then fitted trained to a bootstrap of the presence/background dataset. Each submodel generated a predicted value for the relative probability of species occurrence at every 5 km × 5 km pixel and together the ensemble of submodels generated a distribution of predicted values for every pixel. Mean values together with 0.025 and 0.975 quantile values were then derived from the distribution of predictions at every 5 × 5 km pixel.
Model validation
Withheld data (the test data) from each presence and background dataset were used to validate each mean map generated. The area under the receiver operator curve (AUC) was calculated to assess the mean map’s ability to distinguish species presence points from background points that are representative of the locations surveyed for
Anopheles vectors, whilst marginalising the arbitrary choice of a classification threshold [
20]. An AUC of 0 means the model ranked all sites the wrong way round, 0.5 means the model was no better than random, and an AUC of 1 means it made a perfect prediction. The same test presence and background datasets were used to calculate the AUC value for previously published maps of these species to allow us to compare model performance.
Discussion
This study provides full modelled geographical distributions for
An. coluzzii and
An. gambiae (as now defined) for the first time and clear differences can be seen between these two sibling species, formerly considered a single species. Estimates for the distributions of
An. arabiensis,
An. melas and
An. merus (also within the Gambiae complex) are provided based on improved methods and updated data, resulting in notably better model performance than seen with a previous mapping study [
6]. The geographical distribution of
An. arabiensis has also been modelled in recent years by an independent group [
21]. Their aim was to extrapolate into the future when environmental conditions not currently in existence may occur so they selected a low bias bootstrap aggregation for one class data (LOBAG-OC) model. The data output by their model were not released so a quantitative comparison is not possible but a visual comparison shows broader habitat suitability in the earlier modelled map compared to the current study. The AUC value generated by that study was marginally lower than the value generated here (0.77 compared to 0.78, although caution is needed because the data used to generate these values differed), the data volumes were much lower, and the data for each environmental variable used in the earlier model was a single average over a long time period (1950–2000), meaning the current map is based on a more robust approach.
Although the full geographical distributions of
An. coluzzii and
An. gambiae have not been modelled previously, a recent study modelled the probability of
An. coluzzii presence relative to the probability of
An. gambiae presence in their sympatric range [
22]. It is difficult to compare (1) an analysis of the relative occurrence of two species with (2) two independent species maps, but the results presented here are consistent with the predictions made by the earlier study. Both show
An. coluzzii extending further north and closer to the coast than
An. gambiae within west Africa. An interesting extension of the current work would be to characterize the locations where both species are sympatric while maintaining reproductive isolation.
A modelled distribution for
An. funestus (distinct from the subgroup) has also been generated here for the first time. Only range maps were previously available for this species but the geographical distribution of the Funestus subgroup has been modelled previously [
6]. The geographical distribution for the Funestus subgroup generated by the current study, using the same methods as the current
An. funestus map, showed better model performance than the earlier work.
It is well established that
Anopheles species are strongly influenced by temperature and humidity or wetness [
23‐
25], and previous studies have found differences in the relationships between these variables and individual sibling species [
26‐
32]. The study presented here used a model that provides strong predictive power to generate robust species distributions but it cannot elucidate relationships with individual environmental variables. It was clear, however, that the variables with the strongest influence on each model were wetness and temperature or factors strongly correlated with these two variables, namely elevation and a vegetation index, and the exact set of top predictors varied among each sibling species. Relationships between species and vegetation type have also been found by previous studies [
33,
34] and pollution is known to play a role [
22]. Fourteen land classes were included in this work and some of these were important predictors as was human population density, which is linked to pollution, but caution is needed when interpreting the ranking of covariate influence by the model. The modelling framework used here is not an appropriate tool to confirm the specific relationships with environmental variables found by detailed field studies and the maps also need to be viewed in the context of the scale used. The aim here was to produce continent-wide maps at a 5 × 5 km resolution and the land cover data used by the model were expressed as the proportion of square kilometres within each 5 × 5 km assigned to a particular land cover type. All other environmental and socioeconomic variables were provided to the model at 5 × 5 km resolution. This approach does not capture microscale variation that may be important to these species locally. At the other end of the spectrum, however, for vector and malaria control plans devised at a national or subnational level, the data from the maps presented here can be aggregated to provide information for the areas used for planning.
The temporal resolution of the data used to generate these maps was annual and thus seasonal variation is not captured here and the maps presented provide the relative probability of a species occurring at each location during at least one time of the year. Strong seasonal fluctuations in mosquito abundance occur, particularly in West Africa, and these differ among species [
35‐
38]. If the dataset used here incorporated a systematic bias towards collection times that would miss particular species then this could impact the maps generated, however, the collated data provide good coverage for each species within areas with strong seasonal patterns as well as regions with smaller fluctuations (Additional file
3).
The geographical distribution of a species is not sufficient information alone to inform vector control programmes and these distributions need to be used in combination with data on the key attributes of each species. Use of indoor insecticide-based control measures has resulted in important reductions in vector populations and malaria prevalence [
39,
40] but at the same time the relative abundance of individual malaria vectors has changed [
41‐
44] leading to greater importance placed on mosquitoes that bite or rest outdoors or have less restricted feeding preferences [
8,
45,
46]. Also critically important is resistance to the major insecticide classes and data is essential to provide evidence for insecticide resistance management planning [
47]. Changes in the prevalence of insecticide resistance over time and differences among sibling species are apparent in the dataset presented here but a full analysis of this data must take account of the strong spatial bias in sampled locations and potential confounding factors such as variation in the protocols used. The one exception is susceptibility to organophosphates in
An. funestus, which has remained constant over time at all locations, in agreement with earlier reviews [
48]. It is clear that there are far fewer data available for individual species than for complexes [
49]. There is, however, sufficient data available for the most important vectors species to allow variation in resistance at the species level to be considered and combined with species distributions.
Authors’ contributions
CLM designed the study. AW performed the experiments with support from JL, FS and CLM. KG compiled the insecticide resistance data overview. AW, KG and NCM processed and geopositioned the data with input from CLM, MES and MC. All authors read and approved the final manuscript.