Background
Despite historical and ongoing eradication efforts, human malaria transmitted by
Anopheles mosquitoes continues to be a major public health concern around the world [
1]. Malaria was one of the leading causes of death during the construction of the Panamanian Interoceanic Canal in the early 1900s. In Panama, malaria prevalence oscillated dramatically during the last 50 years, with sporadic and/or cyclical epidemics every five to 10 years [
2‐
4]. Recently however, from 2001 to 2005, a malaria outbreak was documented in indigenous territories known as “
Comarcas” where a sixfold increase in the number of cases was observed [
5,
6]. This epidemic was controlled during subsequent years, and the number of symptomatic cases in the country has dropped considerably since this event. Nonetheless, malaria is still endemic in Panama, and there is potential for future outbreaks particularly in indigenous
Comarcas with health, social and demographic disparities [
3,
6].
Malaria control in Panama is done mainly through eradication of mosquito vectors using toxic insecticides. This strategy requires that the
Anopheles species responsible for transmission be promptly and accurately identified. Nonetheless, identification in Panama is problematic due to a high number of morphologically similar
Anopheles species [
7]. Control strategies to bring malaria down targeting
Anopheles vectors could be ineffective if they tackle a misidentified non-vector species. This is likely the case in Panamanian indigenous
Comarcas where as many as 10
Anopheles species occur in a single locality, and where 40% of these are expected to transmit both
Plasmodium vivax and
P. falciparum to humans [
5,
8]. The identification of
Anopheles mosquitoes in Panama is done using traditional morphological approaches (e.g., dichotomic keys), but this approach requires meticulous taxonomic training and a great deal of entomological expertise [
9,
10]. Also, it is time consuming and could be impractical when inspecting numerous samples [
11]. Hence, Panamanian health authorities require new approaches to accurately and timely sort out vector from non-vector
Anopheles species.
DNA barcoding is a valid alternative to identify arthropod species because it has better taxonomic resolution than morphological approaches, even if the promise of being less expensive has not yet materialized [
12]. DNA barcodes work well with a small amount of tissue, and do not require prior knowledge of insect morphology [
13]. However, generating DNA barcodes requires advanced sample preparation and proper laboratory facilities to extract, amplify and sequence nucleic acids, most of which are rarely found in developing countries where arthropod-borne infections like malaria prevail [
14].
In recent years, matrix–assisted laser desorption/ionization (MALDI) mass spectrometry has become an alternative for arthropod taxonomic identification [
10,
15,
16]. This method has been used effectively to study several aspects of vector biology, including taxonomic status (i.e., species boundaries), pathogen infection rates and food source identity [
17‐
21]. MALDI mass spectrometry uses a profile of the most abundant proteins to “fingerprint” biological samples, and thus, is conceptually similar to DNA barcoding, but possibly cheaper on a per sample basis. Furthermore, MALDI can generate accurate identifications in just a few hours, rather than 5 to 10 days as in the case of DNA barcoding even in “rush” cases [
22]. Previous efforts with MALDI to taxonomically classify members of family Culicidae were successful using both laboratory-reared and field-collected specimens, and specific body parts (e.g., thorax, cephalothorax and/or legs) plus samples from different regions of the world [
9,
10,
23‐
25]. Yssouf et al. [
9,
10] used MALDI with all six legs to classify mosquito species from Africa, Europe and the US while Mewara et al. [
24], using the same approach, accurately identified specimens of four different mosquito genera in Northern India. More recently in France, Vega-Rúa et al. [
25] designed a double entry query protocol with MALDI protein spectra obtained from thoraxes and legs to improve the identification of morphologically compromised specimens. Hence, MALDI’s accurate and rapid identification capabilities might prove ideal to solve the shortcomings of taxonomically classifying
Anopheles mosquitoes in Panama, thus assisting ongoing malaria eradication efforts by improving the vector surveillance system in indigenous
Comarcas.
Here, a methodology based on previously published extraction protocols was adjusted and assessed the accuracy of MALDI identification with a small portion of tissue from the mosquito body to otherwise preserve a specimen voucher. Different statistical procedures were also explored to analyse and classify protein spectra from field-collected mosquitoes, which are difficult to evaluate with currently available strategies from commercial vendors, including working without a reference library of well curated protein spectra. Specifically, the authors ask if MALDI mass spectrometry can discriminate among field-collected individuals of 11 known mosquito species in the genus Anopheles, including taxa that are vectors of human Plasmodium in Panama, plus Chagasia bathana, a closely phylogenetically related and ancestral species to Anopheles.
Methods
Sample preparation and optimization
Initial experiments were conducted with laboratory-reared mosquitoes from three discrete biological species:
Anopheles albimanus (vector of malaria);
Aedes aegypti (vector of Zika and dengue); and
Aedes albopictus (vector of Chikungunya) (Additional file
1). Three different sample preparation protocols (i.e., protein extraction methods), adapted with minor modifications from previous studies [
10,
20], were compared and the one with the most suitable results got selected for further experimentation with field-collected specimens (see the full description of these protocols in Table
1). To test for differences in the mass spectra produced with the three extraction protocols, whole insect-bodies of freshly emerged and starved female mosquitoes were used to avoid noise in the acquired protein signal. Two hundred and twenty-five female mosquitoes were used in total at this point, 25 individuals per species for each protocol. Further, different parts of the body of female mosquitoes (e.g., head, thorax, abdomen, wings and one of the anterior, middle and posterior legs) were assessed to confirm if they contained different protein spectra, and if these spectra were consistent across specimens of the same taxon as it has been shown previously [
9,
10,
23‐
25]. Body parts were dissected using a micro-dissecting kit, placed in separate micro-centrifuge tubes and labeled accordingly. For this evaluation, another 25 lab-reared female individuals of
A. albimanus,
A. aegypti and
A. albopictus (Additional files
2 and
3) were used. Finally, the section of the body with the highest and most consistent protein signal was selected and proceeded to compare whether or not females and males of a given species display differences in their protein spectra as shown by previous studies using whole insect bodies [
20]. Once more, differences between females and males were evaluated using 25 laboratory-reared specimens of
A. albimanus,
A. aegypti and
A. albopictus, respectively (Additional file
4).
Table 1
Description of three different MALDI mass spectrometry protein-extraction protocols used in the present study
Selected mosquito body parts were placed in separate microcentrifuge tubes, rinsed with 300 μL ultrapure water and 900 μL ethanol, and centrifuged at 13,000 rpm for 2 min Samples were decanted and treated with 10 μL of 70% formic acid for 5 min at room temperature Immediately after, samples were homogenized in the tube with the help of a manual pestle with an additional 10 μL of 100% acetonitrile and centrifuged at 13,000 rpm for 2 min A small volume of supernatant was pre-mixed with equal volume of 10 mg/mL α-cyano-4-hydroxycinnamic acid (HCCA) matrix and 1 μL of the mix was quickly placed in its respective target well in triplicate | Selected mosquito body parts were rinsed with distilled water and dried with paper Samples were immediately homogenized with the help of a manual pestle in 20 μL of 70% formic acid and 20 μL of 100% acetonitrile and incubated for 1 h Samples were vortexed for 15 s, centrifuged at 13,000 rpm for 2 min and a small volume of the supernatant was pre-mixed with equal volume of 10 mg/mL HCCA before adding 1 μL of the mix it to the target well in triplicate | Selected mosquito body parts were homogenized with the help of a manual pestle in 20 μL of 10% formic acid, pre-mixing with 1.5 × volume of sinapinic acid matrix, and centrifuged at 13,000 rpm for 2 min 1 μL of supernatant was immediately added to its respective target well in triplicate |
Field-collected Neotropical Anopheles species
For the second part of the study, fresh
Anopheles mosquitoes from four subgenera and seven geographically spread localities in indigenous
Comarcas across Panama (Table
2) were collected. Mosquitoes were collected at night during seven consecutive days per location, using different types of traps (e.g., Human Landing Catch, Intersection, Shannon and Center for Disease Control—CDC—miniature light trap) (Additional file
5). Samples were stored at room temperature in individual, dry microtubes along with silica gel, and transported back to the laboratory in plastic bags. Once in the laboratory, mosquitoes were maintained at − 20 °C to preserve the integrity of their proteins. Initially, all field-collected specimens were sorted and identified to species level using a taxonomic key based on morphological characters of the female [
26]. Then, between ten and 66 individuals per species were processed and analysed using mass spectrometry, for a total of 12 species and 299 specimens (Table
2). For this section of the study, and upon analysing the outcomes of experiments performed during the first part of the methodology, the best extraction protocol and the section of the mosquito body and sex with the highest protein signal and consistency were used. The goal here was to determine if different Neotropical
Anopheles species, non-vectors and vectors of human
Plasmodium, had specific protein profiles generated with MALDI that could be used for rapid and accurate identification purposes.
Table 2
Description of samples subjected to analysis with the MALDI mass spectrometry procedure
Anopheles (Nys) albimanus | 51 | a–g | 153 | 119 | 78 |
Anopheles (An) apicimacula | 40 | b, d, g | 120 | 110 | 92 |
Anopheles (Nys) aquasalis | 19 | c, d | 57 | 56 | 98 |
Anopheles (Nys) darlingi | 14 | b, g | 42 | 40 | 95 |
Anopheles (An) malefactor | 13 | b, d, g | 39 | 39 | 100 |
Anopheles (Nys) nuneztovari | 66 | b, g | 198 | 192 | 97 |
Anopheles (An) pseudopunctipennis | 15 | b, g | 45 | 45 | 100 |
Anopheles (An) punctimacula | 32 | b, d, g | 96 | 81 | 84 |
Anopheles (Nys) strodei | 16 | e | 48 | 48 | 100 |
Anopheles (Nys) triannulatus | 9 | a, f | 27 | 26 | 96 |
Anopheles (Ker) neivai | 10 | c, f | 30 | 24 | 80 |
Chagasia bathana
| 15 | f | 45 | 45 | 100 |
Total | 300 | 7 | 900 | 825 | 92 |
MALDI mass spectrometry parameters
The mass spectrometer used for the measurements was an UltrafleXtreme III (Bruker Daltonics, Bremen, Germany) equipped with a MALDI source, a time-of-flight (TOF) mass analyzer, and a 2 KHhz Smartbeam™-II neodymium-doped yttrium aluminum garnet (Nd:YAG) solid-state laser (λ = 355 nm) used in positive polarization mode. All spectra were acquired with an automatized script in the range of 2000 to 20,000 m/z in linear mode for the detection of the most abundant proteins. Every spectrum represents the accumulation of 5000 shots with 300 shots taken at a time, and the acquisition was done in random-walk mode with a laser power in the range of 50 to 100% (laser attenuation at 20%). To promote the accuracy of the identification algorithms, the spectra collected with the automatic script had to include at least one peak with a minimum intensity of 3500 arbitrary units [a.u] as a stringent parameter of quality to be considered “good quality” spectra. The software FlexAnalysis™ (Bruker) was used to analyse the spectra initially and to evaluate number of peaks, peak intensity and perform simple spectra comparisons to visually inspect for differences in dominant peaks that would suggest possible classification into discrete taxa. All samples were placed and measured on three individual target wells with spectra from three technical replicates collected per well.
Data analysis, statistics and clustering algorithms
For routine mass spectra statistical analysis, including two-dimensional (2D) peak distributions and principal component analysis (PCA), the program ClintProTools™ (Bruker) was used. Individual sample spectra were pre-processed using smoothing and baseline subtraction functions, and three-dimensional (3D) plots were generated to display unsupervised clustering at the subgenera and species levels based on the most abundant protein spectra. However, complete classification of spectra from the field-collected mosquitoes could not be achieved with the manufacturer’s software because reference library entries that conformed to the quality standards of the application could not be created.
For more stringent and comprehensive data clustering and identification, a custom-made Linear Discriminant Analysis (LDA) quantitative approach was implemented using the software MATLAB
® (MathWorks, Natick, MA, USA). Given the size of the samples, a dimensionality reduction stage was implemented using PCA as well. Both approaches have been used in identification in the context of face recognition [
27,
28], and are established methods used in spectral classification in the context of mass spectrometry [
29,
30].
Let the training set of the samples be Γ
1, Γ
2, Γ
3, …, Γ
M−1, Γ
M. The average sample is defined as
\( \Psi = \frac{1}{M}\sum_{i = 1}^{M} \Gamma_{i} \). Each sample differs from the average sample by the vector Φ
i = Γ
i − Ψ. Given the mean-centered sample matrix
\( A = \left[ {\Phi_{ 1} ,\Phi_{ 2} ,\Phi_{ 3} , \ldots ,\Phi_{M - 1} ,\Phi_{M} } \right] \), the covariance matrix
\( C = \frac{1}{M}\sum _{n = 1}^{M} \Phi_{n} \Phi^{T} = AA^{T} \) was calculated. The eigenvectors of this covariance matrix correspond to a set of orthonormal vectors that form a basis to represent the data with a reduced dimensionality. A previously published approach [
28] was used to calculate indirectly the first M eigenvectors of the matrix C, by estimating the eigenvectors of the matrix
L =
ATA, reducing the memory and computational requirements of this procedure.
PCA-based identification consists in using the projection of the sample in the eigenvectors to calculate a set of coefficients
\( \begin{array}{*{20}c} {\omega_{k} = u_{k}^{T} \left[ {\Gamma -\Psi} \right],} & {k = 1,2,3, \ldots ,M' < M} \\ \end{array} \) to describe each sample as a vector
\( \Omega^{T} = \left[ {\omega_{ 1} ,\omega_{ 2} , \ldots ,\omega_{{M^{\prime}}} } \right]. \) The average of the vectors describing the samples of the training set of a given class was used to represent the class in the new basis. Then, to identify a test sample, the Euclidean distance between the vector Ω describing the test sample and the vectors describing each class were calculated. The class with the minimum distance with respect the test sample was assigned to the test sample. The PCA provides basis vectors that correspond to the direction of maximal variance in the sample space. In other words, using maximal variance as an unsupervised parameter for clustering, the test samples are then compared to the classes created with the information of mosquito species that were identified morphologically; if the distance between the test sample vector and the correct class (i.e., mosquito species) vector was the smallest one, this was considered a positive identification. In the other hand, LDA considers class information to provide a basis that best discriminates the classes [
27]. The LDA can be applied over the data set expressed in terms of the coefficients obtained by the PCA. Thus, PCA reduces the dimensionality of the data, and the LDA provides supervised classification.
The LDA basis vectors \( W_{opt} = \left[ {w_{1} w_{2} w_{3} \ldots w_{P} } \right] \) are obtained by calculating the matrix that maximizes the ratio \( \frac{{|W^{T} S_{B} W|}}{{|W^{T} S_{W} W|}} \), where SB and SW are the between-class scatter matrix and the within-class scatter matrix, respectively. This new set of vectors maximizes the distance between class means and minimizes the class variation. For test sample identification, a similar Euclidean distance approach was implemented, as explained for the PCA case. Thus, in this case using the between- and within-class scatter ratio vectors as supervised parameters for clustering, the test samples are compared to the LDA basis vectors that contain the information of mosquito species that were identified morphologically; if the distance between the test sample vector and the correct class (i.e., mosquito species) vector was the smallest one, this was considered a positive identification. The performance of the LDA approach was tested using Monte Carlo cross validation over 500 iterations. For each iteration, the data is split randomly in 80% of the samples for training and 20% of samples for testing, for each species. For such implementation, the first 50 vectors or components from the PCA stage were used, which after being projected for the LDA algorithm, also generated a 50 components data set. This number of components was chosen after a performance analysis using a Monte Carlo approach. This number provided the best identification rates. The total data set consists in 826 spectral samples of 12 species.
Discussion
Addressing the limitations of previous studies with the MALDI
Proof of concept with the MALDI mass spectrometry to examine species boundaries among arthropod vectors of diseases has been well established before in ticks (Ixodidae—
Rhipicephalus) [
16,
18], fleas (Pulicidae—
Ctenocephalides) [
17], tsetse flies (
Glossina spp.) [
19], sandflies (Psychodidae—
Phlebotomus) [
21,
31], biting midges (Ceratopogonidae—
Culicoides) [
32] and mosquitoes (Culicidae) [
10,
20,
22‐
25]. However, many of the experiments conducted up to now with the MALDI involved laboratory-reared specimens and few species or geographically discrete specimens of the same species. Also, with some recent exceptions [
9,
10,
23‐
25], full arthropod bodies were largely used in their protocols, leaving no morphological vouchers for trial confirmation and replication. Moreover, some of these publications employed fairly distinct sample processing protocols, thus making it difficult to decide about their appropriateness and usefulness to study different arthropod groupings. Different methodologies to handle samples with the MALDI mass spectrometry might result in different outcomes, yet few published studies have evaluated the influence of these differences on the resulting protein spectra.
Here, a methodology was adjusted to use mosquitoes of the same sex (i.e., only females) that were processed for a specific body part (e.g., only legs) and with the best protein extraction protocol based on comparisons assumed on initial experiments using lab-colonized mosquitoes (i.e., Protocol #1). The MALDI mass spectrometry technique could also be used effectively and timely to discriminate among field-collected female individuals of various Neotropical
Anopheles species using only one leg, while maintaining good signal robustness. The use of legs to generate protein spectra from ticks and mosquitoes with the MALDI has been successfully accomplished before [
9,
10,
16,
23‐
25], yet so far this approach has not been used to classify samples of Neotropical
Anopheles species, nor has it been applied to field collected specimens that were stored in silica gel.
Considering that one of the objectives of this study was to find the smallest portion of the mosquito that contained enough identifiable information in order to preserve the specimen voucher for other molecular eco-epidemiological assays, the results found with only one of the legs per specimen are very attractive due to the possibility of keeping almost the entire insect body to investigate phylogenetic relationship, pathogen infection rate, and identification of host blood type. Nevertheless, the intensity of the spectra collected with MALDI may be decreased when working with field-collected samples and such limited amount of biological material for homogenization. Still, in this study 92% of the analysed matrix-sample spots offered spectra with high-enough intensity to be picked up by the automatized script (e.g., 825 out of 897 spots from the three technical replicates per specimen), and only 3 of 12 tested species had a spectra collection rate below 90%. Since the groups with the lower spectra success rate included some of the more abundant species such as A. albimanus (78%), A. punctimacula (84%) and A. neivai (80%), and were equally likely across different localities and sampling dates, the lower spectra collection rate could potentially be due to degradation of some samples under unfavorable storage condition, failure to load samples successfully in the metal plate of the MALDI or contamination from the field. However, the procedure allows researchers to try again several times by using any of the remaining legs of the mosquito, thus offering a practical and realistic way around this problem. Future studies will have to test additional conservation methods and determine if preserving samples in silica gel was the cause of low success rates in obtaining the expected number of spectra overall and per species.
A way around working without a reference library of protein spectra
The conventional MALDI biotyper approach for species identification uses a reference library database of laboratory-reared and well-characterized species-specific protein spectra plus computational software from the vendor to compare unknown spectra to those in the reference library. The program generates a degree of similarity between sample spectra and the reference library, and gives a simplified score ranging from 0.0 to 3.0, in which any score above or equal to 2.7, represents a perfect match between a sample spectrum and a particular library spectrum and 2.3 can be used as a minimum threshold for an accurate identification at the species level. This methodology has been very successful for clinical studies involving pathogenic bacteria to humans because they are easy to cultivate in the laboratory and their colony-forming units offer robust and repeatable signals [
33]. However, to build a reference library with fresh and well-curated
Anopheles species requires high-quality, extremely consistent spectra from mosquitoes collected in the field as immature stages and lab-reared in the insectary, which is complicated to accomplish either due to difficulties in field collecting larvae of some species or laboratory-rearing them in the insectary [
34]. To date, only partial reference libraries with protein spectra from a mixture of laboratory-reared and field-collected mosquitoes have been built with mixed quality standards, forcing the use of alternating lower threshold scores for species identification of 1.8 [
9,
10,
23‐
25] or as low as 1.3 in recent studies [
35]. In addition, none of these studies have included Neotropical
Anopheles species.
The quality of the spectra from the field-collected mosquitoes analysed in this study was lower than expected, requiring the use of other statistical techniques for identification. Mass fingerprinting for the identification of field-collected specimens that do not exist in a reference library or for those whose reference spectra cannot be generated, requires alternative approaches that can be developed to detect distinctive features in the spectra of unknown samples. To address this shortcoming, smoothed and baseline corrected spectra were produced from field-collected samples of 11 species of mosquitoes in the genus Anopheles plus Chagasia bathana and compared against the mean spectra from the same field samples as a self-curated reference library. Further, a combination of unsupervised (PCA) and supervised mathematical algorithms (LDA) were used to classify mass spectra of field-collected Anopheles with high consistency.
In general, PCA outcomes were less discriminant and robust than LDA, still PCA discriminated among Anopheles species from different subgenera with almost 90% accuracy and consistency. LDA was able to classify all 12 species of mosquitoes together with validation and cross-validation scores above 93%, both between and within subgenera. This included samples from seven localities across the entire country of Panama, including vectors and non-vectors of Plasmodium. Evidently, the clustering algorithm was more accurate for mosquito species that were phylogenetically distinct from the rest (i.e., Kertezia and Chagasia subgenera), with 100% success rate in these cases; while the success rate decreased for more closely related species (i.e., A. malefactor, from the Arribalzagia Series). Still, the global success rate was 93.33%, which is reasonably precise. Therefore, due to its supervised nature LDA was able to identify field-collected Anopheles species without the need of a reference library of species-specific protein spectra, and with higher resolution and discriminant power than PCA.
Authors’ contributions
JRL and RAG designed and developed the experiments. JRL collected and identified the mosquitoes. AA, NDC and JCR performed the tests with the MALDI. JRL, JSG, FM, AG and RG analysed the data and produced the graphs. JRL and RAG wrote the first draft of the paper and LM, LFL, MJM, JSG, and FM contributed comments to subsequent versions on it. All authors read and approved the final manuscript.