The Plasmodium genome database

Kissinger, Jessica C.; Brunk, Brian P.; Crabtree, Jonathan; Fraunholz, Martin J.; Gajria, Bindu; Milgram, Arthur J.; Pearson, David S.; Schug, Jonathan; Bahl, Amit; Diskin, Sharon J.; Ginsburg, Hagai; Grant, Gregory R.; Gupta, Dinesh; Labo, Philip; Li, Li; Mailman, Matthew D.; McWeeney, Shannon K.; Whetzel, Patricia; Stoeckert, Christian J.; Roos, David S.

doi:10.1038/419490a

Download PDF

Commentary
Published: 03 October 2002

The Plasmodium genome database

Jessica C. Kissinger¹,
Brian P. Brunk¹,
Jonathan Crabtree¹,
Martin J. Fraunholz¹,
Bindu Gajria¹,
Arthur J. Milgram¹,
David S. Pearson¹,
Jonathan Schug¹,
Amit Bahl¹,
Sharon J. Diskin¹,
Hagai Ginsburg¹,
Gregory R. Grant¹,
Dinesh Gupta¹,
Philip Labo¹,
Li Li¹,
Matthew D. Mailman¹,
Shannon K. McWeeney¹,
Patricia Whetzel¹,
Christian J. Stoeckert Jr¹ &
…
David S. Roos¹

Nature volume 419, pages 490–492 (2002)Cite this article

3159 Accesses
132 Citations
3 Altmetric
Metrics details

Designing and mining a eukaryotic genomics resource.

As reported elsewhere in this issue (M. J. Gardner et al. Nature 419, 498–511; 2002), a reference genome sequence for the human malaria parasite Plasmodium falciparum is now complete. But how are researchers to access P. falciparum genome sequence data, integrate this resource with other relevant data sets, and exploit the resulting information for functional studies, including identification of novel drug targets and candidate vaccine antigens?

The Plasmodium genome database (PlasmoDB, see http://PlasmoDB.org) contains information from multiple sources, including DNA sequence data and curated annotations, automated gene model predictions, predicted proteins and protein motifs, cross-species comparisons, optical and genetic mapping data, information on population polymorphisms, expression data generated by a variety of complementary strategies, and proteomics data. Integrating this information at a single site provides 'one-stop shopping' for genomics-scale data sets related to malaria parasites.

The use of a relational database architecture enables users to ask complex questions. For example, immunologists trying to develop an antimalaria vaccine might wish to identify potential immunodominant surface antigens. Drug developers might wish to identify enzymes expressed in bloodstream parasites that differ significantly from their human counterparts. Researchers interested in antigenic variation and how the parasite adheres to cells (a cause of malaria pathogenesis) might wish to identify all gene families in the parasite genome; those interested in genome organization might be interested in the chromosomal location of these proteins; evolutionary biologists might wish to examine all genes for which clear orthologues are known from a range of species; and so on.

Universal access

It has taken six years to complete the P. falciparum genome sequence. In the meantime, interim data were periodically released by the three sequencing centres involved in this project, to advance research on basic malaria biology, and drug and vaccine development. PlasmoDB was developed to make this information available to the research community, notwithstanding the challenges posed by unfinished sequence data. This web-accessible database provides access to the entire genome sequence of the 3D7 reference strain of P. falciparum, together with computationally predicted and manually curated genes and gene models, protein feature predictions and functional annotation.

PlasmoDB went live in June 2000 — more than two years before today's formal completion of the P. falciparum reference sequence. The website receives several thousand hits each day from more than 100 countries, numbers that are certain to rise significantly with the release of the complete genome sequence. The result can be measured in the scores, possibly hundreds, of publications that have resulted, and in new targets now being assessed for drug and vaccine development.

Malaria biologists are a more diverse and dispersed community than those who study fruitfly or yeast genomes. They encompass field scientists in Cameroon, epidemiologists in Papua New Guinea, pharmaceutical developers in India, molecular geneticists in Brazil, and so on. Because many malaria researchers lack reliable high-speed Internet access, a platform-independent CD-ROM (to be distributed with Nature in a few weeks' time) has been developed to provide universal free access to the complete genome sequence and annotations currently available for this malaria parasite. More than a series of 'flat-file' images, P. falciparum GenePlot is a true database, providing a graphical user interface for browsing, querying, downloading and manipulating the genome and annotations on a desktop computer without web access.

It has been a stimulating challenge to see how many commonly asked questions can be accommodated in the CD-ROM format. For example, while local implementation of BLAST searches requires substantial memory and computational speed (and GenBank is too large to include on a single CD), GenePlot can be asked to find and retrieve all predicted proteins with similarity to proteases, based on text indices derived from precomputed BLAST comparisons of the entire P. falciparum genome against all of GenBank.

The initial motivation behind the GenePlot CD was to make the genome accessible to malaria biologists with limited Internet connectivity, but this format has also proved enormously popular with well-connected users. Having the data literally 'in hand' provides scientists everywhere with a sense of ownership and involvement in the Plasmodium genome project, expediting the pace of research and discovery related to malaria parasites and the devastating diseases they cause.

Unfinished business

In most genomics projects, initial mapping studies (desirable even with the advent of whole-genome shotgun sequencing) are followed by a random sequencing phase, then by a phase focusing on closure of remaining gaps to produce a 'finished' sequence (which may still contain numerous gaps, depending on complexity and size of the genome, time, patience and funding). Annotation is conducted to various levels of depth. Database development makes the information accessible to the user community. Finally, functional studies (transcript profiling, proteomic studies, genome-scale knockouts, and so on) become possible once the complete, annotated sequence is available to end-users.

There are good reasons for this sequential strategy. Gap closure is expensive, and so makes little sense while random sequencing may still yield useful information. Manual annotation of assembled sequences is also laborious, and is best deferred until the genome sequence is complete. For large, complex eukaryotic genomes, years may pass between the initial sequencing and the availability of this information in practical form for researchers in the lab. Such delays cause considerable frustration, as individual genes could be identified long before assembly of a finished genome.

Problems associated with unfinished data, and the accompanying need for user education regarding the interpretation of these results, provided the first challenge for PlasmoDB. Specific information missing in incomplete data sets limits confidence that a particular gene is absent from the organism. Contaminating sequences from cloning vectors and host cells may be present. Redundancy in the data set attributable to incomplete or inaccurate assemblies poses a further problem, particularly for the A/T-rich P. falciparum genome. In PlasmoDB, possible redundancy or inaccurate assembly was identified by high-stringency comparisons of each sequence with the entire genome; and comparison of DNA sequences with optical and genetic maps. The importance of these tools for P. falciparum declines as the genome project approaches completion, but they remain valuable for new projects, such as the other Plasmodium species now being sequenced.

Unfinished sequence data also pose challenges for gene identification and analysis, as the constantly changing nature of this information makes time-consuming manual annotation impossible. Comparisons with GenBank, computational gene-finding algorithms and protein feature analyses are feasible (Box 1), but generate a bewildering range of predictions: which of four competing gene predictions is most likely to be correct? Which of 60 sequences exhibiting similarity to cathepsin is really a protease? Automated analysis can help to provide provisional assignments early, before manual curation of the finished sequence. Even after first-pass annotation, these analyses can help to suggest alternative possibilities whenever new experimental information suggests inaccuracies in the curated annotation.

Integrative 'omics'

Many disciplines accommodate large data sets (MRI imaging, weather forecasting, ecological and econometric modelling, and so on), but this is a relatively new problem for molecular and cell biologists. How to collect the deluge of data engulfing us from genomics, transcriptomics, proteomics, glycomics, pharmacogenomics, vaccinomics, and even more hideously named approaches? What kind of tools will be required to analyse — and to integrate — these massive 'omics'-scale data sets? How can we use all this information to treat malaria?

PlasmoDB is based on a relational database architecture (GUS; Box 1), built around biologically relevant relationships following the central dogma of biology: 'gene to messenger RNA to protein'. Parallel views for other organisms (including other Plasmodium species) allow phylogenetic comparisons. Because all this information is in a single database, queries can combine searches for particular genes of interest with RNA and protein expression analysis, studies on population genetic polymorphisms, and cross-species comparison. One can envisage the incorporation of other data types, such as publication records, clinical outcome data, genomic information from the mosquito vector Anopheles gambiae, protein structural information from high-throughput crystallography studies, and chemical compound libraries.

The power of database queries

PlasmoDB provides graphic and text-based views of all available Plasmodium genomic sequences, curated annotations, and tools for retrieval of these data. But the sheer wealth of information can make browsing difficult, so the database allows the user to define custom views. For all their visual appeal, however, static, precomputed views are inherently restricted, and so fail to answer many genomic-scale questions that arise in the laboratory.

The relational database underlying PlasmoDB permits queries that integrate diverse data types, as illustrated by questions relating to drug and vaccine development (Table 1). For example, a medicinal chemist might be interested in P. falciparum dihydrofolate reductase (DHFR), the target of the drug pyrimethamine used in common antimalarial agents. The gene encoding this enzyme can be identified by EC number or GO function, text searches of curated annotation using the enzyme name as a key word, text searches against BLAST results, motif searches for protein sequence signatures, BLAST similarity to DHFR sequences from other species, or searches based on protein structural predictions. Degenerate searches are also possible, such as searching for all proteases. The results returned would undoubtedly contain false positives, but these can be weeded out by scientists familiar with protease characteristics. Candidate cytoskeletal proteins can be identified by similar strategies, or searches based on protein structural predictions. Such searches can then be refined, for example by identifying sequences conserved in multiple malaria parasites, or those that are sufficiently distinct from human orthologues to provide a basis for selective inhibition.

Table 1 Querying Plasmodium genome data

Full size table

Information on metabolic pathways and/or subcellular localization can also be used to inform database queries. For example, PlasmoDB enables the identification of proteins likely to be associated with the apicoplast — a distinctive organelle that has received considerable attention as a candidate drug target — on the basis of curated annotation, exploiting the structured gene ontology (GO) vocabulary. Alternatively, the origins of this organelle by horizontal transfer of an algal chloroplast can be exploited as the basis for a text search for genes exhibiting sequence similarity to plastid, chloroplast or plant genes. Phylo-genetic comparison with plant species is not currently supported in PlasmoDB, but all nucleotide and predicted protein sequences can be downloaded by users for local analysis.

Combining gene and protein predictions with the results from RNA and/or protein expression analysis enables enzymes being considered for antimalarial drug development to be filtered, removing any proteins not expressed in blood-stage parasites. Integrating these data with functional studies, polymorphism data, publications, or small-molecule databases, would allow further refinement.

For immunologists, computationally accessible queries allow identification of particular genes of interest as vaccine antigens (see Table 1). Additional gene-family members can be recognized on the basis of sequence similarity. Probable surface antigens can be identified from the presence of signal sequences, transmembrane domains, acylation signals or glycophosphatidylinositol (GPI) anchor motifs. Additional queries of immunological relevance might include the presence of predicted immunodominant epitopes, expression in life-cycle stage(s) of interest, conservation in multiple P. falciparum isolates, and evidence of immune selection based on highly repetitive elements, low-complexity sequence or polymorphisms identified in population genetic studies.

PlasmoDB can be used to build complex queries using boolean operators. For example, searching PlasmoDB release 3.3 for all genes predicted to contain a secretory signal sequence yields 1,952 hits. Because this search used curated annotations plus the predictions from any one of several distinct gene-finding algorithms, the results are several-fold redundant, yielding about 800 distinct genes, or more than 15% of the parasite genome. More than twice as many proteins (5,003) are predicted to contain transmembrane domains, but the intersection of these results yields only 1,083 hits (about 400 distinct proteins) exhibiting both features. Next, the database can be searched for all messenger RNAs known from expressed sequence tag (EST) evidence, yielding 3,057 hits (searches based on microarray or proteomics evidence are also possible). The intersection between these secretory pathway and expression searches identifies a grand total of 190 candidates, probably corresponding to fewer than 100 distinct genes.

Two key points emerge from these queries. First, the power of a database devoted to mining genomics-scale data sets comes from its ability to form relational (integrated) queries, allowing researchers to frame their own questions. No encyclopaedic version of precomputed analyses and 'canned' queries will ever provide all possible answers in advance. For example, neither computational analysis nor manual curation would have been likely to identify enzymes associated with the apicoplast before this organelle was discovered and its targeting signals mapped.

Second, the goal of these queries is not to get the 'right' answer (a provably correct list of valid drug targets or vaccine antigens), but to reduce the options, filtering the overwhelming number of sequences in the genome down to a few genes amenable to experimental analysis — in short, to let computers do what computers do well, and to let people do what people do well. Integrating the results of such studies into the database completes the loop, with computational and experimental analysis in the lab building on each other to accelerate the pace of biological research.

The CD-ROM containing P. falciparum GenePlot and other malaria-related resources, including Nature's malaria Insight of 7 February 2002 and the papers reported elsewhere in this issue, will be provided to all Nature subscribers in a few weeks' time. It can also be obtained from helpcd@plasmodb.org or the Malaria Research and Reference Reagent Resource Center (MR4) by an e-mail request to malaria@atcc.org , with “Nature malaria CD-ROM” in the subject line. A full postal address must be included in the body of the message.

Box 1: The architecture of PlasmoDB

PlasmoDB is not itself a database, but a web interface that uses an underlying relational database (GUS, for genomics unified schema), which stores and integrates nucleotide sequences, annotation, information on gene expression and regulation, controlled vocabularies/ontologies, and evidence for these annotations. GUS is organism-independent and also contains the human and mouse genomes (http://www.allgenes.org). The schema, associated code and project-independent data are at http://www.gusdb.org.

Primary P. falciparum sequence data are subjected to automated analyses (sequence analysis layer), including the identification of motifs and simple repeats; comparison against the entire genome to identify gene families, repetitive elements and redundancy; searching for intron/exon structure, using several algorithms trained on experimentally validated P. falciparum sequences; conceptual gene translation and identification of potential protein motifs; and comparisons with the non-redundant GenBank/EMBL database (results retained in a text-queryable index). Genomic contig sequences are aligned to optical restriction maps and microsatellite linkage groups using hidden Markov models for fragment length and ePCR.

The GUS schema employs views that are used in an object layer for parent–child relationships. To facilitate data loading, Perl was used to create a 'thin' object layer in which each relational table is treated as an object. GUS is partitioned into distinct name spaces. Core contains workflow tables, tracking how each row in the database is populated (data provenance). Sres (shared resources) contains controlled vocabularies and ontologies, such as taxonomy, anatomy and disease tables. TESS captures descriptions (grammar representations) for genetic regulatory regions (not currently implemented for PlasmoDB). DoTS houses sequence and sequence annotation. Any sequence span can have multiple features mapped to it, and gene predictions can be associated with multiple transcripts and proteins. Each predicted or experimentally determined transcript may itself have multiple features and similarities, as can each protein entry. RAD handles data from high-throughput technologies for studying gene expression. RAD currently accommodates expression data from SAGE (serial analysis of gene expression), cDNA and oligonucleotide glass slide microarrays, and Affymetrix chips, and is extensible to accommodate information from other platforms. Sample information, together with other experimental descriptions, can be entered directly into the database via web-based forms.

The RAD schema is compliant with MIAME guidelines (http://www.mged.org/). A microarray gene expression (MAGE) object model and XML-based language have been developed for data exchange, and importers and exporters are being built for RAD to MAGE-ML.

Author information

Authors and Affiliations

The Plasmodium Genome Database Collaborative, The Departments of Biology and Genetics, Center for Bioinformatics and Genomics Institute, University of Pennsylvania, Philadelphia, 19104-6018, Pennsylvania, USA
Jessica C. Kissinger, Brian P. Brunk, Jonathan Crabtree, Martin J. Fraunholz, Bindu Gajria, Arthur J. Milgram, David S. Pearson, Jonathan Schug, Amit Bahl, Sharon J. Diskin, Hagai Ginsburg, Gregory R. Grant, Dinesh Gupta, Philip Labo, Li Li, Matthew D. Mailman, Shannon K. McWeeney, Patricia Whetzel, Christian J. Stoeckert Jr & David S. Roos

Authors

Jessica C. Kissinger
View author publications
You can also search for this author in PubMed Google Scholar
Brian P. Brunk
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Crabtree
View author publications
You can also search for this author in PubMed Google Scholar
Martin J. Fraunholz
View author publications
You can also search for this author in PubMed Google Scholar
Bindu Gajria
View author publications
You can also search for this author in PubMed Google Scholar
Arthur J. Milgram
View author publications
You can also search for this author in PubMed Google Scholar
David S. Pearson
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Schug
View author publications
You can also search for this author in PubMed Google Scholar
Amit Bahl
View author publications
You can also search for this author in PubMed Google Scholar
Sharon J. Diskin
View author publications
You can also search for this author in PubMed Google Scholar
Hagai Ginsburg
View author publications
You can also search for this author in PubMed Google Scholar
Gregory R. Grant
View author publications
You can also search for this author in PubMed Google Scholar
Dinesh Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Philip Labo
View author publications
You can also search for this author in PubMed Google Scholar
Li Li
View author publications
You can also search for this author in PubMed Google Scholar
Matthew D. Mailman
View author publications
You can also search for this author in PubMed Google Scholar
Shannon K. McWeeney
View author publications
You can also search for this author in PubMed Google Scholar
Patricia Whetzel
View author publications
You can also search for this author in PubMed Google Scholar
Christian J. Stoeckert Jr
View author publications
You can also search for this author in PubMed Google Scholar
David S. Roos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jessica C. Kissinger.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kissinger, J., Brunk, B., Crabtree, J. et al. The Plasmodium genome database. Nature 419, 490–492 (2002). https://doi.org/10.1038/419490a

Download citation

Issue Date: 03 October 2002
DOI: https://doi.org/10.1038/419490a

This article is cited by

Antiplasmodial natural products: an update
- Nasir Tajuddeen
- Fanie R. Van Heerden
Malaria Journal (2019)
New insights into the Plasmodium vivax transcriptome using RNA-Seq
- Lei Zhu
- Sachel Mok
- Zbynek Bozdech
Scientific Reports (2016)
Expression profile of the Plasmodium falciparum intra-erythrocytic stage protein, PF3D7_1363700
- Renee N Roberts
- Maggie S Schlarman
- Brenda T Beerntsen
Malaria Journal (2013)
Transcript and protein expression profile of PF11_0394, a Plasmodium falciparum protein expressed in salivary gland sporozoites
- Maggie S Schlarman
- Renee N Roberts
- Brenda T Beerntsen
Malaria Journal (2012)
Chemical genetics of Plasmodium falciparum
- W. Armand Guiguemde
- Anang A. Shelat
- R. Kiplin Guy
Nature (2010)

The Plasmodium genome database

Box 1: The architecture of PlasmoDB

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

This article is cited by

Antiplasmodial natural products: an update

New insights into the Plasmodium vivax transcriptome using RNA-Seq

Expression profile of the Plasmodium falciparum intra-erythrocytic stage protein, PF3D7_1363700

Transcript and protein expression profile of PF11_0394, a Plasmodium falciparum protein expressed in salivary gland sporozoites

Chemical genetics of Plasmodium falciparum

Search

Quick links

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Antiplasmodial natural products: an update

New insights into the Plasmodium vivax transcriptome using RNA-Seq

Expression profile of the Plasmodium falciparum intra-erythrocytic stage protein, PF3D7_1363700

Transcript and protein expression profile of PF11_0394, a Plasmodium falciparum protein expressed in salivary gland sporozoites

Chemical genetics of Plasmodium falciparum

Search

Quick links