Background
Malaria is the most devastating parasitic disease in the world. The disease affects more than 216 million people and kills nearly 655,000 people every year [
1]. More than forty percent of the world’s population lives at risk of infection [
2]. Parasite resistance to available chemotherapy drugs and also vector resistance to insecticides are increasing and spreading around the world, which impacts disease control [
3,
4]. The persistent huge socioeconomic impact of the disease and reports of resurgence in African countries show that, despite the control efforts, malaria is still a global health challenge [
5]. Hence, new strategies to control malaria are essential to combat, eliminate or even eradicate the disease [
6].
In recent years, the sequencing of genomes, transcriptomes and proteomes and their related high-throughput analyses have become major strategies for unraveling the detailed aspects of
Plasmodium biology and the interactions between the parasite and its vertebrate and invertebrate hosts [
7,
8]. Genome sequence data is available for at least eight
Plasmodium species, providing opportunities for many groups to join the renaissance in malaria research and translate this massive amount of data into commercially available new drugs and anti-malarial vaccines, which are still promises of the genomic era [
3,
9].
The first step in genome-wide identification of new drug targets or vaccine candidates is mainly based on identification of molecules in the interface between parasite and hosts or members of unique biochemical pathways in the parasite through
in silico strategies, and these analyses are highly dependent on the accuracy of genome annotations. Massive-scale sequencing has considerably improved the annotation strategies, particularly the process of identifying coding sequences (CDS) [
10]. However, gene annotation is still far from trivial, especially in eukaryotic genomes, and one of the most difficult tasks is the identification of the initial methionine and intron/exons boundaries [
11].
Homology reflects the evolutionary history of genes and, after the recent expansion of genomics, has reemerged as one of the key concepts of evolutionary biology. Orthologs are genes derived from a single ancestral gene in the last common ancestor of the species being compared, whereas paralogs are genes related via duplication events [
12]. Orthology is the basis of any comparative genetic analysis, because orthologs tend to retain equivalent molecular and biological functions [
13], justifying its use in interspecific comparative analyses to assist in gene annotation, by exploring evolutionary gene histories, conservation, variability of molecular sequences and functional characterizations [
14].
Protein trafficking is essential for all organisms and this process is primarily governed by intrinsic signals found in protein sequences. The best-known and studied transport motif is the signal peptide, usually located in the N-terminal end of proteins that are translocated across the plasma membrane (prokaryotes) or the endoplasmic reticulum membrane (eukaryotes) [
15‐
17]. Signal peptides play an indirect role on the biological function of proteins in the sense that they help determine the subcellular environment a given protein will be available for interactions [
18]. Since function is usually conserved among orthologous proteins, it was hypothesized that subcellular localization and, consequently, signal peptide status are expected to behave accordingly. Divergences among orthologs could be explained by (
i) misannotation of protein sequences; (
ii) limitations of the methodologies used to predict signal peptides or to assign orthology relationships; or (
iii) true biological divergence resulting from singular evolutionary history of that gene in each species.
In order to study the source of divergences among Plasmodium proteins, an innovative yet computationally simple strategy was devised, in which signal peptide predictions and orthology were combined. This strategy helped to determine the prevalence of N-terminal sequence misannotations among Plasmodium proteins and, more importantly, it guided the process of revision of misannotated proteins, therefore improving the available information on protein sorting for the genus.
Discussion
The approach intended in this work shifts the perception of signal peptide data as an exclusive property of individual proteins to a perspective where it also becomes a descriptive characteristic of orthologous groups of proteins, with groups being classified into three distinct categories: Positive, Negative or Mixed. It is important to note that paralogs, in general, evolve diverging functions more rapidly and more often than orthologs [
12,
30]. Therefore, the expected conservation of signal peptide predictions among orthologs does not necessarily hold true for paralogs, as demonstrated for
P. vivax VIR protein family [
31]. Even though OrthoMCL will only cluster recent paralogs [
20], groups containing multiple proteins per species were excluded from the reannotation analyses. This was done because, generally, paralogs are not as conserved as orthologs in function and, consequently, in signal peptide state, according to the ortholog conjecture [
32]. The ortholog conjecture is the paradigm behind the widespread use of orthology in comparative biology, however, it has always been a rather theoretical proposition, and only recently it was put to the test. Some studies have contested its validity, especially for the direct link made between function and sequence similarity. In one particular study, paralogs were shown to be better predictors of function than orthologs [
33]. On the other hand, the ortholog conjecture has been reaffirmed by other studies that showed significantly more conservation of protein structure and expression profiles among orthologs than paralogs [
32]. These recent studies clearly signal that the debate is still open. The conservation of signal peptides has not been addressed directly until now.
The strategy was applied to the predicted protein sets of five Plasmodium species and found that an expressive number of proteins showed diverging signal peptide predictions when compared to their orthologs. The rate of Mixed groups observed was higher than expected, considering the rarity of divergence and the close evolutionary proximity of the species studied (same genus). Therefore, a few probable explanations were considered: (i) Misannotated proteins, particularly their N-terminal end; (ii) Errors or shortcomings in the predicting programs; and (iii) Biological diversity due to divergence in the course of evolution, which constituted the real Mixed groups.
Misannotated sequences were the most likely source of diverging signal peptide predictions. It is known that definition of initial methionine is the most challenging tasks for gene annotating algorithms, particularly for eukaryotes, which means that annotation of the N-terminal end of proteins, exactly where most signal peptides are found, is intrinsically less accurate [
34]. The majority of
Mixed groups had at least one protein that needed N-terminal sequence reannotation. Comparing
Positive and
Negative groups, the rate of misannotated proteins in
Mixed groups is much higher, signaling that the combinatory strategy was efficient for enrichment of misannotated sequences, a desirable trait in a quality control mechanism for sequence accuracy in genomic scale.
Most protein reannotations resulted in altered signal peptide predictions, which in turn were converted into the reclassification of orthologous groups. The new distribution of Positive, Mixed and Negative groups demonstrates that having orthologs drastically diverging in their putative subcellular targeting is far less usual than previously shown, and this erroneous interpretation was mostly due to sequence misannotation. The observed reduction of Mixed groups from 541 to 289 due to reannotation is, indeed, a conservative estimate as additional reannotations are still a possibility for the 17 Inconclusive groups and the 82 groups Partially reannotated, in which there are proteins that could not be reannotated at this moment. Therefore, the eventual correction of these groups could result in an even lower rate of Mixed groups.
The main reason preventing the reannotation of proteins from groups
Partially reannotated was the truncation of the upstream flanking region. This is directly related to the assembly states of genomes and explains why
P. yoelii genes were most affected. According to PlasmoDB (v7.1), among the studied species,
P. yoelii has the genome with the highest count of unassigned contigs (5687), followed by
P. vivax (2770). Another reflection of the assembly state of
P. yoelii genome is made clear in Figure
2A, in which proteins from this species seem to be missing from several orthologous groups. Improvements in the genome assembly would likely result in the identification of these missing orthologs by gene prediction algorithms.
Sequence misannotations are more likely to generate negatively predicted proteins. Since signal peptides are defined by typical structural constrains [
35], the probability that any randomly chosen amino acid stretch (≥ 40 amino acids), coded by a genomic sequence and having a methionine in the first position, will hold a signal peptide is lower than otherwise (data not shown). Therefore, proteins with wrongly assigned initial methionine tend to show negative signal peptide predictions. Thus, while most proteins without signal peptide will preserve their signal peptide predictions even if misannotated, most proteins with signal peptide will have their predictions inverted when misannotated. This uneven effect explains why the rate of misannotations is higher in
Negative than in
Positive groups and why most suggested reannotations have resulted in proteins turning from negative to positive predictions. The underlying message is that, as a rule, this particular reannotation strategy tends to increase the set of proteins predicted to have signal peptides, as demonstrated for four out of the five species studied, and this biased enrichment of positive proteins may be beneficial in the search for new vaccine targets.
In an effort to understand the persistent classification of some groups as Mixed, signal peptide prediction itself was also investigated as a source of divergence among orthologs. When combining orthology and signal peptide information, the default settings applied by PlasmoDB for signal peptide prediction were used, however, there were concerns on how well adjusted were these settings, and whether was there room for improvement. With the intention of avoiding or, at least, reducing the number of false Mixed groups created by faulty predictions, it was reasoned that optimal prediction conditions would be found when predictions among orthologs reached their highest level of agreement, minimizing the number of Mixed groups. Optimization was carried out only once, after reannotations were incorporated to the database, when ideally, parameter optimization and sequence reannotations should work to complement each other in an iterative process, with new reannotations being incorporated at each round and optimal conditions being recalculated afterwards. Therefore, the new thresholds suggested here should be considered with caution because there are still many factors that could cause further alterations (new reannotations, incorporation of new genes, changes in orthology), and should not be taken as definitive values.
Also, optimization by itself does not correct intrinsic software limitations such as a biased training dataset. Although SignalP is a robust application and has been widely employed, its eukaryotic training dataset is dominated by mammalian sequences [
36] and it is possible that signal peptides from
Plasmodium proteins are somewhat different from those of mammals. This difference alone could be responsible for overestimation of the divergence. Nonetheless, the results offer a refreshing view on how to improve signal peptide predictions within clusters of species without having to implement major changes in existing prediction softwares, and it could also contribute to the development of predictors as
Mixed groups may help identify which sequences are beyond current detection limits and should, therefore, be incorporated in future training data sets.
Pre-calculated orthology clustering was chosen over an independent assessment because this information is readily available for download from PlasmoDB reflecting the resources available to the malaria community. For the same reason, PlasmoDB’s SignalP prediction settings were used instead of settings from SignalP standalone version. Also, by using pre-calculated clustering the strategy became less computationally demanding. Independent clustering could have an impact in reannotations, as groups could have been added or lost, but the major results and the overall conclusions would not change. Last, considering the evolutionary proximity of the studied species and the high conservation observed among orthologs in most groups, clustering would not vary much from that obtained from OrthoMCL.
Biological features of
Plasmodium could also justify difficulties in signal peptide prediction. Some
Plasmodium secretory proteins use ‘unconventional protein secretion’ which collectively describe several kinds of unusual trafficking pathways that lead to the exposure of proteins on cell surfaces or to their release into the extracellular space [
37,
38]. This includes Golgi-independent trafficking of integral membrane proteins [
39] and other variations of transport modes within the classical secretory pathway [
37,
38]. In these cases, typical signal peptides are not present, and many known secreted proteins of
Plasmodium are included in this category, for example, RESA, GBP-130, Pf41-2, PfHPRT, FIRA, among others [
40]. However, even for these proteins the expected conservation of signal peptide prediction state is applicable. If a given protein is trafficked via an alternative route and features a negative signal peptide prediction, the same result is to be expected from its orthologs, as they would also be subjected to the same biological processing.
Plasmodium has a very complex life cycle with multiple invasion steps mediated by highly specialized apical organelles (rhoptries, micronemes and dense granules), and targeting to these organelles is signal peptide dependent [
41]. Once invaded, red blood cells (RBCs) are remodeled by
Plasmodium in a process that involves the export of several parasite proteins to the cytoplasm and membrane surface of RBCs [
42]. And again, signal peptides are required for allowing entrance into the ER and subsequent targeting to the parasitophorous vacuole (PV) lumen, the default secretory pathway for
P. falciparum proteins [
43]. Biological diversity within the
Plasmodium genus is also a possible explanation for
Mixed orthologous groups, and the implications of divergent orthologs are rather interesting, as they are likely to be involved in processes that are unique to a few or even one organism. In
Plasmodium, these genes could mediate or interfere with any of the several singular phenomena that set species apart, such as sequestration in
P. falciparum, host cell invasion preferences of merozoite, variability in the maturation or morphology of gametocytes or the formation of latent stages in
P. vivax[
44]. Identification of such instances, where interspecific diversity could be occurring, is of utmost relevance to malariology. However, unequivocal demonstration of biological divergence, in terms of protein localization, demands experimental procedures (i.e.: fluorescent protein tagging, immunohistochemistry with specific antibodies), which are beyond the scope of this work. Nonetheless, the likelihood of finding true biological diversity was narrowed to a subset of 141 groups that have kept their mixed classification despite efforts of reannotation and optimization. Interestingly, signal peptide prediction patterns that concur with the phylogeny of
Plasmodium species were significantly over represented in these groups, which argues in favor of biological novelties as the observed divergences could then be attributed to the particular evolutionary history of each species. The proteins from these groups in particular warrant further studies to confirm or reject their link to biological phenomena restricted to subsets of
Plasmodium species.
The reannotations being proposed redefine the sets of proteins that are targeted to the ER of Plasmodium organisms and are highly relevant, since protein trafficking is crucial for the successful development of these organisms within their hosts. Therefore, direct as well as indirect experimental evidences were important to support reannotations. Although validation of new gene models through RT-PCR does not allow proper identification of initial methionine, it clearly demonstrated that the new gene models are a good fit to the mRNAs being expressed by parasites, whereas original gene models were not. Only some reannotated proteins were prone to RT-PCR validation as a difference in the number of exons or a modification of exon/intron boundaries between original and new gene models are required. Apart from this prerequisite, targets for validations were chosen so both inversion and maintenance of signal peptide prediction cases were covered.
Available evidences of protein localization and their correlation to signal peptide predictions for the new protein sequences were also analysed. Although only eight have been experimentally validated, most of their localizations are in accordance to their newfound signal peptide predictions. In fact, one of them [PlasmoDB:PVX_090075], a protein localized in the rhoptries, has been characterized as a promising vaccine candidate capable of eliciting a humoral immune response and the proliferation of lymphocytes from human patients [
45]. The only conflicting protein is a male-specific protein [PlasmoDB:PFB0400w] said to be cytoplasmatic according to immunofluorescence assays, however its patchy and diffuse pattern coupled to secretory signal sequence, also identified in the manuscript, suggested that the protein may be located in cytoplasmic vesicles instead [
46]. Even when experimental evidence from orthologs were considered, signal peptide prediction of a given protein and the localization of its orthologs were highly agreeable. Out of 40 proteins with positive predictions, only one [PlasmoDB:PVX_117660] shows a signal peptide prediction incompatible with its ortholog’s localization. However, this protein shows a positive prediction, even before reannotation and it is the only protein in its orthologous group with a signal peptide, thus it remains to be experimentally demonstrated whether this
P. vivax protein is indeed different from its orthologs. Among the negatively predicted proteins, concordance to orthologous localizations was lower, however, most signal peptide predictions from the orthologs themselves conflicted with their localizations. A possible explanation for contradiction between negatively predicted proteins and subcellular localization might be alternative sorting routes independent of signal peptides as discussed before.
Another challenge for signal peptide prediction in
Plasmodium species is the presence of a unique organelle from apicomplexa resulting from secondary endosymbiosis, the apicoplast. This organelle is an active site of protein transcription/translation and DNA replication [
47,
48]. Pharmacological and genetic perturbation of the apicoplast led to parasite death [
49,
50] and it was recently described that the essential function of the apicoplast is biosynthesis of an isoprenoid precursor during the blood-stage growth. Its essentiality for the parasite survival and the absence of a metabolic counterpart in human host make the apicoplast proteins promising targets for anti-malarial drug development [
51]. As only a few proteins (~50 mostly housekeeping genes) are encoded in the organellar genome [
52], most of apicoplast housed proteins (~500 proteins), coded in the nuclear genome, must be transported to the apicoplast, via a mechanism mediated by a bipartite sorting element formed by a signal peptide followed by a transit peptide [
53]. Several of the reannotated proteins that became positively predicted have orthologs that are apicoplast-targeted, demonstrating how these reannotation efforts may assist in the quest for new anti-malarial drugs.
The search for new intervention targets for disease control is increasingly dependent on computational approaches that query and filter vast amounts of biological data, which makes annotation accuracy a priority since imprecise inputs will return low quality results. Signal peptides, for example, are extensively employed as a filter in reverse vaccinology strategies [
54], as targets for humoral response are usually secreted or surface attached proteins, and misinformation on protein N-terminal sequences would certainly prevent correct identification of putative targets. Most of the major
Plasmodium vaccine candidates (i.e.: AMA-1, Pfs230, CS, PvDBP) [
55] are proteins that have predicted signal peptides, demonstrating how important this feature can be in the discovery of new vaccine targets. Also, information on signal peptides can be incorporated in the process of selecting drug targets when it is known or expected that the metabolic process to suffer intervention takes place in membrane bound organelles or cellular compartments. Once more,
Plasmodium stands as a good example, as it has already been demonstrated that the food vacuole [
56] and the apicoplast [
49] are susceptible to anti-malarial compounds, and protein targeting to both these organelles is signal peptide dependent [
57]. Apicoplast targeting was one of the filtering criteria for identifying attractive drug targets in
Plasmodium falciparum in a study that used a comprehensive
in silico approach [
58].