Background
Anopheles gambiae sensu stricto (s.s.) is the most important vector of human malaria in Africa, causing 90% of the fatalcases worldwide [
1]. It is believed that the differentiation of this very synanthropic and anthropophilic species within the
A. gambiae complex is very recent, having taken place a few thousand years ago, as a result of expansion of human activities, which provided mosquitoes with new opportunities for breeding, eventually creating a worsening in malaria transmission in sub-Saharan Africa [
2].
Chromosomal and molecular evidence from West Africa suggests that
A. gambiae s.s. is currently undergoing incipient speciation leading to a segregation by reproductive isolation of (at least) two "molecular forms" provisionally named M and S [
3‐
6]. These forms have a largely overlapping range west of the Great Rift Valley, although their relative frequencies are very different on a micro-geographic scale, probably due to adaptation to differentiated larval habitats [
7‐
10]. Due to common background of shared ancestral polymorphisms and to the still ongoing (although limited) gene flow, M and S forms are characterized by an overall very low degree of genetic differentiation, which has been shown to be mostly restricted to three unlinked regions of their genome. Two are adjacent to the centromere of 2L and X chromosomes and the other is in a small portion of the 2R chromosome ("genomic islands of speciation" [
11,
12]). Although the overall picture suggests that we are observing speciation at its very early stages, the taxonomic status of
A. gambiae s.s. molecular forms has not yet been established, nor has consensus been reached on whether or not they should be considered as entities on independent evolutionary trajectories, i.e. either as polymorphic components of a single species, or as emerging species. This issue is of great interest not only from an evolutionary point of view, but also because it has important implications both for malaria epidemiology and for the optimization of vector-based control strategies.
One major constraint to progress toward a solution of this debate is represented by difficulties in finding molecular markers with different/contrasting evolutionary dynamics, which would allow to get a better understanding of the strength of the reproductive barrier between molecular forms. In fact, so far, M and S forms are characterized by form-specific single nucleotide polymorphisms (SNPs) in the spacer regions of ribosomal DNA (rDNA) [
13‐
15] and their population genetics has been analysed mostly by microsatellite approach, which present important intrinsic (e.g. low differentiation between M and S and homoplasy) and technical (e.g. need of sequencing facilities) drawbacks, which have limited their exploitation [
16‐
19].
Recently, the analysis of the insertion patterns of transposable elements (TEs) (i.e. mobile genetic units capable of replicating and spreading in the host genome) has been successfully applied to support genetic differentiation between
A. gambiae molecular forms [
5,
20‐
22]. Among TEs, Short INterspersed Elements (SINEs) have been extensively used as phylogenetic and population genetic markers in primate taxa [
23] and, preliminary, in
A. gambiae [
5,
20]. SINEs are 100–500 bp long non-autonomous retrotransposons occurring in large copy numbers in eukaryotic genomes [
24‐
27], that need to recruit enzymes encoded by Long INterspersed Elements (LINEs) to mobilize after transcription via RNA polymerase III [
28,
29]. They present unique features absent in most other TEs, which make them particularly useful for phylogenetic and population genetic studies: i) they can be considered 'homoplasy-free characters' because the chance of independent insertions/excisions into/from the same site is remote; therefore, the ancestral state is represented by the absence of the element at a locus and shared insertions at that locus are identical by descent [
23,
30]; ii) since they are short, they can be amplified even from low-quality genomic DNA and insertion polymorphisms at individual genomic locations can be easily and rapidly assayed by PCR [
31]; iii) polymorphic SINEs are believed to be recently inserted and, thus, can help illuminate recent evolutionary events and resolve complexities in the population genetics structure [
30‐
34].
SINE200 is a ~200 bp element that is highly repetitive (>3,000 copies) and widespread in the
A. gambiae s.s. genome [
35]. Here we report the structure of this element and the results of a large scale analysis aimed to highlight different patterns of
SINE200 insertion polymorphism between
A. gambiae molecular forms at loci inside the speciation islands and propose the exploitation of these elements as novel molecular markers for the identification and/or population genetic analysis of M and S forms.
Discussion
The analysis of the consensus sequence of
SINE200 indicates that it is a typical tRNA-related SINE element. In fact, it has a tRNA-related region at the 5' end with the A and B boxes found in polymerase III promoters. It also has a variable number of the AAG tandem repeat at the 3' end, which is also typical for tRNA-related SINEs [
42]. The middle of
SINE200 is a conserved sequence that is not related to tRNA sequences, as already described for other eukaryotic
SINE elements [
43].
Eight
SINE200 loci within
A. gambiae s.s. speciation islands were analysed, as follows: i) two on the X-chromosome, one of which (i.e.
S200 X6.2) was absent in all specimens tested, while the other (i.e.
S200 X6.1) was fixed in the M-form and absent in the S-form samples; ii) one on 2R (i.e.
S200 2R12D), which was found polymorphic in both molecular forms; and iii) five on 2L, which were all fixed in both forms. The observed high frequency of fixation of the insertions in centromeric areas probably reflects a common behaviour of transposable elements, which tend to accumulate in regions of reduced recombination [
44], as also suggested for other retrotrasposon classes in the
A. gambiae genome [
21].
The observed differences in the allelic frequencies at
S200 2R12D locus highlight a significant reduction of gene-flow between the two molecular forms. This represents an additional evidence in support of the relevance of this small chromosomal region in the speciation process ongoing within
A. gambiae s.s., as proposed by Turner
et al [
11]. Interestingly,
S200 2R12D lies in close proximity (about 20 Kb) to an odour receptor gene (i.e. GPR-OR38), which has been suggested to be likely related to reproductive isolation between molecular forms [
12]. Moreover, a similar level of differentiation was observed within M-form, suggesting a subdivision between western and western-central M-populations (Figure
1). This sub-structuring observed within the M-form is consistent with recent evidence from a wide microsatellite analysis carried out on the same M-form populations [
45] and with previous observations by Slotman
et al [
6], who suggests that M populations from Mali and Cameroon may no longer be considered a "single entity". It should be noted, however, that
S200 2R12D locus lies within 2Rb chromosomal inversion, which is shared by M and S forms and shows different frequencies in various eco-geographic areas [
4,
5]. It is thus possible that the spread of this element in natural populations is affected by 2Rb inversion polymorphism, although preliminary data show that
S200 2R12D insertion is not exclusive of one of the two alternative chromosomal arrangements (i.e. 2R+
b and 2Rb). Further studies on larger karyotyped samples are ongoing to evaluate a possible association between the 2Rb inversion and the element insertion.
As it is recognized that
SINE s do not excide from a genome after their insertion [
23,
30] and since all
SINE200 loci analysed were found to be specific of
A. gambiae s.s., the analysed insertions likely occurred after divergence of this species from the other members of the
A. gambiae complex. Moreover,
S200 X6.1 was found to be exclusive of and highly conserved in the M-form and, therefore, probably recently integrated in its genome after divergence of molecular forms within the chromosome-X speciation island. This locus lies in proximity of CYP4G16, a gene of the cytochrome P450 family which has been indicated as a candidate gene in the incipient speciation process ongoing within
A. gambiae s.s. [
11].
In addition to the above cited indications in favour of a possible fruitful exploitation of
SINE200 in the study of the sub-structuring of
A. gambiae, the exclusive presence of
S200 X6.1 in the M-form allows to propose a novel straightforward approach to distinguish
A. gambiae s.s. molecular forms. In fact, all methods developed so far for their identification are based on point mutations in IGS region of rDNA, which is formed by several tandem arrays known to be subjected to concerted evolution. Thus, possible diagnostic problems, in particular in the interpretation of hybrid M/S patterns, may arise from incomplete homogenization of the arrays through concerted evolution and/or mixtures of M and S IGS-sequences among the arrays of single chromatids, due to recombination between copies on the X and Y chromosomes [
15]. The
S200 X6.1 locus, on the other hand, although located only about 1 Mb from IGS-region, does not show these constraints, being present in a single copy on the X-chromosome. Moreover, it is important to highlight that PCR-RFLP [
38,
39], and IMP-PCR [
13,
46] methods currently used for M and S identification are based on the recognition of single/few mutation(s), and thus subjected to homoplasy. On the other hand, the PCR diagnostic approach here proposed is based on the specific and irreversible insertion of a 230 bp element in the M-form (and its absence in S-form), thus allowing an unambiguous, simple and straightforward recognition of M and S forms (Figure
3). It is also interesting to note that, although the S-form amplicon is identical to those of
A. melas and
A. quadriannulatus, the 26 bp deletion reported for
A. arabiensis allows to propose the use of the novel approach to discriminate
A. gambiae from
A. arabiensis specimens without preliminary species identification in large areas of sub-saharan Africa where
A. gambiae molecular forms and
A. arabiensis are the only species of the complex present.
Acknowledgements
We are grateful to all scientists and entomology teams who provided samples utilized in this study; we especially thank K. Adasi, M. Akogbeto, T. Baldet, G. Carrara, C. Costantini, P.J. Cani, C. Curtis, I. Dia, J. Dossou-yovo, N. Elissa, F. Fortes, A. Mendjibe, YT. Touré, W. Takken, S. Torr and G. Vale for help with sample collections. We also thank JMC. Ribeiro for help with bioinformatic analyses of SINE200 and V. Petrarca and J. Pinto for useful discussion on data and manuscript. The project was funded by NIH-grant AI42121 to ZT. EM was supported by Compagnia di San Paolo (Torino, Italy) in the context of the Italian Malaria Network.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
FS and EM carried out the molecular processing, participated in the analysis and interpretation of data, and in the drafting of the manuscript; YQ contributed to the molecular processing; FS collected part of the samples and to the drafting of the manuscript; ZT proposed the study and contributed to the set-up of the experimental approach, data analysis and drafting of the manuscript; AdT conceived and coordinated the study and wrote the manuscript. All authors read and approved the final manuscript.