Abstract
Functional annotation of novel sequence data is a primary requirement for the utilization of functional genomics approaches in plant research. In this paper, we describe the Blast2GO suite as a comprehensive bioinformatics tool for functional annotation of sequences and data mining on the resulting annotations, primarily based on the gene ontology (GO) vocabulary. Blast2GO optimizes function transfer from homologous sequences through an elaborate algorithm that considers similarity, the extension of the homology, the database of choice, the GO hierarchy, and the quality of the original annotations. The tool includes numerous functions for the visualization, management, and statistical analysis of annotation results, including gene set enrichment analysis. The application supports InterPro, enzyme codes, KEGG pathways, GO direct acyclic graphs (DAGs), and GOSlim. Blast2GO is a suitable tool for plant genomics research because of its versatility, easy installation, and friendly use.
1. Introduction
Functional genomics research has expanded enormously in the last decade and particularly the plant biology research community has extensively included functional genomics approaches in their recent research proposals. The number of Affymetrix plant GeneChips, for example, has doubled in the last two years [1] and extensive international genomics consortia exist for major crops (see last PAG Conference reports for an updated impression on current plant genomics, http://www.intl-pag.org). Not less importantly, many middle-sized research groups are also setting up plant EST projects and producing custom microarray platforms [2]. This massive generation of plant sequence data and rapid spread of functional genomics technologies among plant research labs has created a strong demand for bioinformatics resources adapted to vegetative species. Functional annotation of novel plant DNA sequences is probably one of the top requirements in plant functional genomics as this holds, to a great extent, the key to the biological interpretation of experimental results. Controlled vocabularies have imposed along the way as the strategy of choice for the effective annotation of the function of gene products. The use of controlled vocabularies greatly facilitates the exchange of biological knowledge and the benefit from computational resources that manage this knowledge. The gene ontology (GO, http://www.geneontology.org) [3] is probably the most extensive scheme today for the description of gene product functions but also other systems such as enzyme codes [4], KEGG pathways [5], FunCat [6], or COG [7] are widely used within molecular databases. Many bioinformatics tools and methods have been developed to assist in the assignment of functional terms to gene products (reviewed in [8]). Fewer resources, however, are available when it comes to the large-scale functional annotation of novel sequence data of nonmodel species, as would be specifically required in many plant functional genomics projects. Web-based tools for the functional annotation of new sequences include AutoFact [9], GOanna/AgBase [10], GOAnno [11], Goblet [12], GoFigure + GoDel [13], GoPET [14], Gotcha [15], HT-GO-FAT (liru.ars.usda.gov/ht-go-fat.htm), InterProScan [16], JAFA [17], OntoBlast [18], and PFP [19]. Additionally, functional annotation capabilities are usually incorporated in EST analysis pipelines. A few relevant examples are ESTExplorer, ESTIMA, ESTree. or JUICE (see [2] for a survey in EST analysis). These resources are valuable tools for the assignment of functional terms to uncharacterized sequences but usually lack high-throughput and data mining capabilities, in the first case, or provide automatic solutions without much user interactivity, in the second. In this paper, we describe the Blast2GO (B2G, www.blast2go.org) application for the functional annotation, management, and data mining of novel sequence data through the use of common controlled vocabulary schemas. The philosophy behind B2G development was the creation of an extensive, user-friendly, and research-oriented framework for large-scale function assignments. The main application domain of the tool is the functional genomics of nonmodel organisms and it is primarily intended to support research in experimental labs where bioinformatics support may not be strong. Since its release in September 2005 [20], more than 100 labs worldwide have become B2G users and the application has been referenced in over thirty peer-reviewed publications (www.blast2go.org/citations). Although B2G has a broad species application scope, the project originated in a crop genomics research environment and there is quite some accumulated experience in the use of B2G in plants, which includes maize, tobacco, citrus, Soybean, grape, or tomato. Projects range from functional assignments of ESTs [21–24] to GO term annotation of custom or commercial plant microarrays [25, 26], functional profiling studies [27–29], and functional characterization of specific plant gene families [30, 31].
In the following sections we will explain more extensively the concepts behind Blast2GO. We will describe in detail main functionalities of the application and show a use case that illustrates the applicability of B2G to plant functional genomics research.
2. Blast2GO Highlights
Four main driving concepts form the foundation of the Blast2GO software: biology orientation, high-throughput, annotation flexibility,and data-mining capability.
Biology orientation. The target users of Blast2GO are biology researchers working on functional genomics projects in labs where strong bioinformatics support is not necessarily present. Therefore, the application has been conceived to be easy to install, to have minimal setup and maintenance requirements, and to offer an intuitive user interface. B2G has been implemented as a multiplatform Java desktop application made accessible by Java Webstart technology. This solution employs the higher versatility of a locally running application while assuring automatic updates provided that an internet connection is available. This implementation has proven to work very efficiently in the fast transfer to users of new functionalities and for bug fixes. Furthermore, access to data in B2G is reinforced by graphical parameters that on one hand allow the easy identification and selection of sequences at various stages of the annotation process and, on the other hand, permit the joint visualization of annotation results and highlighting of most relevant features.
High-throughput while interactive. Blast2GO strives to be the application of choice for the annotation of novel sequences in functional genomics projects where thousands of fragments need to be characterized. In principle, B2G accepts any amount of records within the memory resources of the user's work station. Typical data files of 20 to 30 thousand sequences can be easily annotated on a 2 Giga RAM PC (larger projects may use the graphical interface free version of Blast2GO). During the annotation process, intermediate results can be accessed and modified by the user if desired.
Flexible annotation. Functional annotation in Blast2GO is based on homology transfer. Within this framework, the actual annotation procedure is configurable and permits the design of different annotation strategies. Blast2GO annotation parameters include the choice of search database, the strength and number of blast results, the extension of the query-hit match, the quality of the transferred annotations, and the inclusion of motif annotation. Vocabularies supported by B2G are gene ontology terms, enzyme codes (EC), InterPro IDs, and KEGG pathways.
Data mining on annotation results. Blast2GO is not a mere generator of functional annotations. The application includes a wide range of statistical and graphical functions for the evaluation of the annotation procedure and the final results. Especially, (relative) abundance of functional terms can be easily assessed and visualized.
The first release of B2G covered basic application functionalities: high-throughput blast against NCBI or local databases, mapping, annotation, and gene set enrichment analysis; scalar vector graphics (SVG) combined graphs and basic distributions charts. Enhanced modules for massive blast, modification of annotation intensity, curation, additional vocabularies, high-performing customizable graphs and pathway charts, data mining and sequence handling, as well as a wide array of input and output formats have been incorporated into the Blast2GO suite.
3. The Blast2GO Application
Figure 1 shows the basic components of the Blast2GO suite. Functional assignments proceed through an elaborate annotation procedure that comprises a central strategy plus refinement functions. Next, visualization and data mining engines permit exploiting the annotation results to gain functional knowledge.
3.1. The Annotation Procedure
The Blast2GO annotation procedure consists of three main steps: blast to find homologous sequences, mapping to collect GO terms associated to blast hits, and annotation to assign trustworthy information to query sequences. Once GO terms have been gathered, additional functionalities enable processing and modification of annotation results.
Blast step. The first step in B2G is to find sequences similar to a query set by blast [32]. B2G accepts nucleotide and protein sequences in FASTA format and supports the four basic blast programs (blastx, blastp, blastn, and tblastx). Homology searches can be launched against public databases such as (the) NCBI nr using a query-friendly version of blast (QBlast). This is the default option and in this case, no additional installations are needed. Alternatively, blast can be run locally against a proprietary FASTA-formatted database, which requires a working www-blast installation. The Make Filtered Blast-GO-BD function in the Tools menu allows the creation of customized databases containing only GO-annotated entries, which can be used in combination with the local blast option. Other configurable parameters at the blast step are the expectation value (e-value) threshold, the number of retrieved hits, and the minimal alignment length (hsp length) which permits the exclusion of hits with short, low e-value matches from the sources of functional terms. Annotation, however, will ultimately be based on sequence similarity levels as similarity percentages are independent of database size and more intuitive than e-values. Blast2GO parses blast results and presents the information for each sequence in table format. Query sequence descriptions are obtained by applying a language processing algorithm to hit descriptions, which extracts informative names and avoids low-content terms such as “hypothetical protein” or “expressed protein”.
Mapping step. Mapping is the process of retrieving GO terms associated to the hits obtained after a blast search. B2G performs three different mappings as follows. (1) Blast result accessions are used to retrieve gene names (symbols) making use of two mapping files provided by NCBI (geneinfo, gene2accession). Identified gene names are searched in the species-specific entries of the gene product table of the GO database. (2) Blast result GI identifiers are used to retrieve UniProt IDs making use of a mapping file from PIR (Non-redundant Reference Protein database) including PSD, UniProt, Swiss-Prot, TrEMBL, RefSeq, GenPept, and PDB. (3) Blast result accessions are searched directly in the DBXRef Table of the GO database.
Annotation step. This is the process of assigning functional terms to query sequences from the pool of GO terms gathered in the mapping step. Function assignment is based on the gene ontology vocabulary. Mapping from GO terms to enzyme codes permits the subsequent recovery of enzyme codes and KEGG pathway annotations. The B2G annotation algorithm takes into consideration the similarity between query and hit sequences, the quality of the source of GO assignments, and the structure of the GO DAG. For each query sequence and each candidate GO term, an annotation score (AS) is computed (see Figure 2). The AS is composed of two terms. The first, direct term (DT), represents the highest similarity value among the hit sequences bearing this GO term, weighted by a factor corresponding to its evidence code (EC). A GO term EC is present for every annotation in the GO database to indicate the procedure of functional assignment. ECs vary from experimental evidence, such as inferred by direct assay (IDA) to unsupervised assignments such as inferred by electronic annotation (IEA). The second term (AT) of the annotation rule introduces the possibility of abstraction into the annotation algorithm. Abstraction is defined as the annotation to a parent node when several child nodes are present in the GO candidate pool. This term multiplies the number of total GOs unified at the node by a user-defined factor or GO weight (GOw) that controls the possibility and strength of abstraction. When all ECw’s are set to 1 (no EC control) and the GOw is set to 0 (no abstraction is possible), the annotation score of a given GO term equals the highest similarity value among the blast hits annotated with that term. If the ECw is smaller than one, the DT decreases and higher query-hit similarities are required to surpass the annotation threshold. If the GOw is not equal to zero, the AT becomes contributing and the annotation of a parent node is possible if multiple child nodes coexist that do not reach the annotation cutoff. Default values of B2G annotation parameters were chosen to optimize the ratio between annotation coverage and annotation accuracy [20]. Finally, the AR selects the lowest terms per branch that exceed a user-defined threshold.
The annotation step in B2G can be further adjusted by setting additional filters to the hit sequences considered as annotation source. A lower limit can be set at the e-value parameter to ensure a minimum confidence at the level of homology. Similarly, %“hit” filter has been implemented to assure that a given percentage of the hit sequence is actually spanned by the query. This parameter is of importance to prevent potential function transfer from nonmatching sequence regions of modular proteins. Additionally, the minimal hsp length required at the blast step permits control of the length of the matching region.
3.2. Modulation of Annotation
Blast2GO includes different functionalities to complete and modify the annotations obtained through the above-defined procedure.
Additional vocabularies. Enzyme codes and KEGG pathway annotations are generated from the direct mapping of GO terms to their enzyme code equivalents. Additionally, Blast2GO offers InterPro searches directly from the B2G interface. The user, identified by his/her email address, has the possibility of selecting different databases available at the InterProEBI web server [33]. B2G launches sequence queries in batch, and recovers, parses, and uploads InterPro results. Furthermore, InterPro IDs can be mapped to GO terms and merged with blast-derived GO annotations to provide one integrated annotation result. In this process, B2G ensures that only the lowest term per branch remains in the final annotation set, removing possible parent-child relationships originating from the merging action.
Annotation fine-tuning. Blast2GO incorporates three additional functionalities for the refinement of annotation results. Firstly, the Annex function allows annotation augmentation through the Second Layer concept developed by The Norwegian University of Science and Technology (http://www.goat.no, [34]). Basically, the Second Layer database is a collection of manually curated univocal relationships between GO terms from the different GO categories that permits the inference of biological process and cellular component terms from molecular function annotations. Up to 15% of annotation increase and around 30% of GO term confirmations are obtained through the Annex dataset [20]. Secondly, annotation results can be summarized through GOSlim mapping. GOSlim consists of a subset of the gene ontology vocabulary encompassing key ontological terms and a mapping function between the full GO and the GOSlim. Different GOSlim mappings are available, adapted to specific biological domains. At present, GOSlim mappings for plant, yeast, from GOA and Tair, as well as a generic one are available from the GO through Blast2GO. Thirdly, the manual curation function means that the user has the possibility of editing annotation results and manually modifying GO terms and sequence descriptors.
3.3. Visualization and Data Mining
One aspect of the uniqueness of the Blast2GO software is the availability of a wide array of functions to monitor, evaluate, and visualize the annotation process and results. The purpose of these functions is to help understand how functional annotation proceeds and to optimize performance.
Statistical charts. Summary statistics charts are generated after each of the annotation steps. Distribution plots for e-value and similarity within blast results give an idea of the degree of homology that query sequences have in the searched database. Once mapping has been completed, the user can check the distribution of evidence codes in the recovered GO terms and the original database sources of annotations. These charts give an indication of suitable values for B2G annotation parameters. For example, when a good overall level of sequence similarity is obtained for the dataset, the default annotation cutoff value could be raised to improve annotation accuracy. Similarly, if evidence code charts indicate a low representation of experimentally derived GOs, the user might choose to increase the weight given to electronic annotations. After the final annotation step, new charts show the distribution of annotated sequences, the number of GOs per sequence, the number of sequences per GO, and the distribution of annotations per GO level, which jointly provide a general overview of the performance of the annotation procedure.
Sequence coloring. The visual approach of B2G is further represented by the color code given to annotated sequences. During the annotation process, the background color of active sequences changes according to their analysis status. Nonblasted sequences are displayed in white and change to light red once a positive blast result is obtained. If the result was negative, they will stay dark red. Mapped sequences are depicted in green while annotated sequences become blue. Finally, manually curated sequences can be labeled and colored purple (see Figure 3(A)). Sequence coloring is a simple and effective way of identifying sequences that have reached differential stages during the annotation process. Furthermore, sequences can be selected by their color. This is a very useful function for the interactive use of the application. For example, sequences that stayed dark red after blast (no positive result) can be selected to be launched to InterProScan. Sequences that remained green (mapping code) after the annotation step can be selected and reannotated with more permissive parameters.
Combined graph. A core functionality of Blast2GO is the joint visualization of groups of GO terms within the structure of the GO DAG. The combined graph function is typically used to study the collective biological meaning of a set of sequences. Combined graphs are a good alternative to enrichment analysis (see below) where no reference set is to be considered or the number of involved sequences is low. B2G includes several parameters to make these combined graphs easy to analyze and navigate. Firstly, the ZWF format [35], a powerful scalable vector graphics engine, has been adopted to make zooming and browsing through the DAG fast and light. Secondly, annotation-rich areas of the generated DAG can be readily spotted by a node-coloring function. B2G colors nodes either by the number of sequences gathered at that term (additive function) or by a node information score (exponential function, ) that considers the places of direct annotation. This B2G score takes into account the amount of sequences collected at a given term but penalizes by the distance to the node of actual annotation [20]. The B2G score has shown to be a useful parameter for the identification of “hot” terms within a specific DAG (see Figure 3(B); Conesa, unpublished). Thirdly, the extension and density of the plotted DAG can be modulated by a node filter function. When the number of sequences involved in the combined graph is large, the resulting DAG can be too big to be practical. B2G permits filtering out of low informative terms by imposing a threshold on the number of annotated sequences or B2G scores for a node to be displayed. In this case, the number of omitted nodes is given for each branch, which is an indication of the level of local compression applied to that branch.
Enrichment analysis. A typical data mining approach applied in functional genomics research is the identification of functional classes that statistically differ between two lists of terms. For example, one might want to know the functional categories that are over- or underrepresented in the set of differentially expressed genes of a microarray experiment, or it could be of interest to find which functions are distinctly represented between different libraries of an EST collection. Blast2GO has integrated the Gossip [36] package for statistical assessment of differences in GO term abundance between two sets of sequences. This package employs the Fisher's exact test and corrects for multiple testing. For this analysis, the involved sequences with their annotations must be loaded in the application. B2G returns the GO terms under- or overrepresented at a specified significance value. Results are given as a plain table and graphically as a bar chart and as a DAG with nodes colored by their significance value. Also in this case, graph pruning and summarizing functions are available.
3.4. Other Functionalities
Next to the annotation and data mining functions, Blast2GO comprises a number of additional functionalities to handle data. In this section, we briefly comment on some of them.
Import and export. B2G provides different formats for the exchange of data. Typically, B2G inputs are FASTA-formatted sequences and returns a tab-delimited file with GO annotations. Other supported output formats are GOstats and GOSpring. Furthermore, B2G also accepts blast results in xml format. This option permits skipping the first step of the B2G annotation procedure when a blast result is already present. Similarly, when accession IDs or gene symbols are known for the query sequences, these can be directly uploaded in B2G and the application will query the B2G database for their annotations. Moreover, the Main Sequence table (see Figure 3) can be saved to a file at any moment to store intermediate results. Finally, graph and enrichment analysis results are presented both graphically and as text files.
Validation. The “true path rule” defined by the gene ontology consortium for the GO DAG assures that all the terms in the pathway from a term up to the root must always be true for a given gene product. The B2G annotation validation function applies this property to annotation results by removing any parent term that has a child within the sequence annotation set. B2G always executes validation after any modification has been made to the existing annotation, for exaple, after InterPro merging, Annex augmentation, or manual curation.
Comparison of two sets of GO terms. Given two annotation results, Blast2GO can compare their implicit DAG structures. B2G computes the number of identical nodes, more general and more specific terms within the same branch, and terms located to different branches or different GO main categories. Comparison is directional; this means that the active annotation file is contrasted to a reference or external one. Each GO term is compared to all terms in the reference set and the best matching comparison result is recorded. Once a term is matched, it is removed from the query set.
3.5. Some Performance Figures
The annotation accuracy of Blast2GO has been evaluated by comparing B2G GO annotation results to the existing annotation in a set of manually annotated Arabidopsis proteins that had been previously removed from the nr database. This evaluation indicated that using B2G default parameters, nearly 70% of identical branch recovery was achievable, which is at the top end of the methods that are based on homology search [20]. More recent evaluations have shown that Blast2GO annotation behavior is consistent across species and datasets. In general, the blast step has shown to be decisive in the annotation coverage. For a great deal of sequences with a positive blast result, functional information is available in the GO database and the final annotation success is related to the length and quality of the query sequence and the strictness of annotation parameters. Typically and using default parameters, around 50–60% of annotation success is common for EST datasets and slightly higher values are obtained for full-length proteins (Table 1).
On average, between 3 and 6 GO terms are assigned per sequence at a mean GO level very close to 5. InterPro, Annex, and GOw annotation parameters significantly increase annotation intensity—around 15%—and validate annotation results. Furthermore, default annotation options tend to provide coherent results and resemble the functional assignment obtained by a human computational reviewed analysis [37].
3.6. Use Case
In this section, we present a typical use case of Blast2GO to illustrate the major application features described in the previous sections. We will address the functional annotation of the Soybean Affymetrix GeneChip. The GeneChip Soybean Genome Array targets over 37,500 Soybean transcripts (www.affymetrix.com). The array also contains transcripts for studying two pathogens important for Soybean research. Sequence data and a detailed annotation sheet for the Soybean Genome Array are provided at the Blast2GO site (http://blast2go.bioinfo.cipf.es/b2gdata/soybean).
Blast
Sequence data in FASTA format were uploaded into the
application from the menu File Open File. After selecting the Blast menu, a
dialog opens where we can indicate the parameters for the blast step. In our
case, the easiest option is to select the nr protein database and perform blast
remotely on the NCBI server through Qblast. Additional blast parameters are
kept at default values: e-value threshold of 1e-3 and a recovery of 20 hits per
sequence. These permissive values are chosen to retrieve a large amount of
information at this first time-consuming step. Annotation stringency will be decided
later in the annotation procedure. Furthermore, we set the hsp filter to 33 to
avoid hits where the length of the matching region is smaller than 100
nucleotides. After launching, blast sequences turn red as results arrive, up to
a total of 22,788. Once blast is completed, we can visualize different charts
(similarity, e-value, and species distributions, see supplementary
material available online at
http://dx.doi.org/10.1155/2008/619832)
to get an impression of the quality of the query sequences and the blast procedure.
For example, Statistics
Blast statistics
Similarity distribution chart (see Figure 4) shows that most sequences have
blast similarity values of 50–60% or higher. This information is useful for
choosing the annotation cutoff parameter at the annotation step, and suggests
that taking a value of 60 would be adequate. Furthermore, the Species distribution
chart (see Figure 5) shows a great majority of Arabidopsis sequences within the
blast hits, followed by Cotton, Medicago, Glycine, and Nicotiana.
Mapping
Mapping is a nonconfigurable option launched from
the menu Mapping Make Mapping. GO terms could be found for 21,079 sequences
(56%). Mapping charts (menu Statistics Mapping Statistics) permit the
evaluation of mapping results. The evidence code distribution chart (see Figure 6) shows an overrepresentation of electronic annotations, although other
nonautomatic codes, such as review by computational analysis (RCA), inferred by
mutant phenotype (IMP), or inferred by direct assay (IDA) are also well
represented. This suggests that an annotation strategy that promotes nonelectronic
ECs would be meaningful as it would benefit from the high-quality GO terms
without totally excluding electronic annotations. Therefore, the default EC
weights (menu Annotation Set Evidence Code Weights) that adjust
proportionally to the reliability of the source annotation will be maintained
at the annotation step.
Annotation
Taking into consideration the charts generated by
the previous steps, we have chosen an annotation configuration with an e-value
filter of 1e-6, default gradual EC weights, a GO weight of 15, and an
annotation cutoff of 60. This implies that only sequences with a blast e-value
lower than 1e-6 will be considered in the annotation formula, that the query-hit
similarity value adjusted by the EC weight of the GO term should be at least
60, and that abstraction is strongly promoted. This annotation configuration
resulted in 17,778 successfully GO annotated sequences with a total of 70,035
GO terms at a mean GO level (distance of the GO term to the ontology root term)
of 4.72. Furthermore, 6,345 enzyme codes were mapped to a total of 5,390
sequences. Once annotation has been completed, we can visualize the results at
each step of the annotation process (see Figure 7). Reannotation is possible by
selecting green or red
sequences (Tools Select Sequences by Color) and rerunning blast, mapping, and
annotation with different, more permissive parameters. In this way, we obtain a
trustworthy annotation for most sequences and behave more permissively only for
those sequences which are hard to annotate. Other charts available at the
Annotation Statistics menu show the distribution of GO levels (see Figure 8),
the length of annotated sequences, and the histogram of GO term abundance.
Annotation augmentation
Blast-based GO annotations can be increased by means
of the integrated InterProScan function available under Annotation Run InterProScan.
The user must provide his/her email address and select the motif databases of
interest. An InterProScan search against all EBI databases resulted in the
recovery of motif functional information for 11,347 sequences and a total of
8,046 GO terms. Once merged to the already existing annotation (Annotation
Add InterProScan GOs to Annotation), 1,189 additional sequences were annotated
(see Figure 9). Once Blast plus InterProScan annotations have been gathered, a
useful step is to complete implicit annotations through the Annex function
(Annotation Augment Annotation by Annex). After this step, it is recommended
to run the function to remove first-level annotations (under Annotation
menu). In our use case, the Annex
function resulted in the addition of 8,125 new GO terms and a confirmation of
3,892 annotations, which is an average contribution of the Annex function [37].
Manual curation
The manual annotation tool is a useful functionality
when information on the automatically generated annotation needs to be changed.
For example, the target of GmaAffx.69219.1.S1_at probe was found to be the
UDP-glycosyltransferase. The automatic procedure assigned GO terms metabolic
process (GO:0008152) and transferase activity, transferring hexosyl groups
(GO:0016758) to this sequence. However, as we are aware of the ER localization
of this enzyme and its involvement in protein maturation, we would like to add
this information to the existing annotation. The manual curation function is
available at the Sequence Menu which is displayed by mouse right button click
on the selected sequence. From this Menu, the blast and annotation results for
this particular sequence can be visualized. Selection of Change Annotations and
Description edits the annotation record of GmaAffx.69219.1.S1_at. We can now
type in the Annotations box the terms GO:0005783 (endoplasmic reticulum) and
GO:0006464 (protein modification process) and mark the manual annotation box.
The new annotations are then added and the sequence turns purple (manual
annotation color code).
GOSlim
As the number of sequences and different GO terms in
the Soybean array is quite large, we are interested in a simpler representation
of the functional content of the data. An appropriate option is to map annotations
into a GOSlim. At Annotation Change to GOSlim View, we can select an
appropriate GOSlim (generic_plant) for this dataset. Upon completion of
slimming sequences acquire the yellow GOSlim coding. The original annotations
are stored and can be recovered at any moment. GOSlim mapping generated a set
of 105 different annotating GO terms on 18,820 sequences with a mean GO level
of 3.41. This means around 40 times less functional diversity than in the
original annotation (4533 different terms) and an increase of almost 2 levels
of the mean annotation depth.
Combined graph
Once the slimmed annotation is obtained, we can
visualize the functional information of the Soybean Genome Array on the GO DAG.
This functionality is available under Analysis Combined Graph. At the Dialog
we must indicate the GO category to display (e.g., biological process). To
obtain a compact representation of the information, two filters can be applied.
For example, by setting the sequence filter to 20, only those nodes with at
least 20 sequence assignments will be displayed. By setting the score filter to
20, additionally, parent nodes that do not annotate more sequences than their
children terms will be omitted from the graph. Node coloring by score value
highlights the areas in the resulting DAG where sequence annotations are most
concentrated. Figure 10 shows the Combined Graph for the Molecular Function
Category. The two most intensively colored terms at the second GO level
indicate the two most abundant functional categories in the Soybean Chip:
catalytic activity and binding. Highlighting at lower levels reveals other,
most informative, highly represented functional terms, such as hydrolase
activity (level 3), kinase activity (level 4), transcription factor activity (level
3), protein binding (level 3), nucleotide binding (level 3), and transporter
activity (level 2). The reader is referred to the annotation sheet URL
(http://blast2go.bioinfo.cipf.es/b2gdata/soybean) for figure navigation.
Enrichment analysis
The enrichment analysis function in B2G executes a
statistical assessment of differences in functional classes between two groups
of sequences. To illustrate this function, we have selected all sequences in
the Soybean chip which contain the word “membrane” within their description—132 sequences—and compared their annotations to the whole chip. We go to
Analysis Enrichment Analysis Make Fisher's Exact Test and browse for a text
file containing the test set with the names of membrane sequences. As the
comparison is made against the complete microarray dataset loaded into the
application, no file needs to be selected as Reference. We uncheck the two-tail
box to perform only positive enrichment analysis. Upon completion a table with test statistical
results is presented in the Statistics tab. This table contains significant GO
terms which are ranked according to their significance. Three different
significance parameters are given for false-positive control: false discovery
rate (FDR), family-wise error rate (FWER), and single test P-value
(Fisher P-value) (see [36] for details). By taking a FDR significance
threshold of 0.05, we obtain those functionalities that are strongly
significant for membrane proteins in the Soybean Chip. These refer to processes
related to transport, protein targeting, and photosynthesis as might be
expected for a plant species. Graphical representations of these results can be
generated at Analysis Enrichment Analysis Bar Chart and Analysis
Enrichment Analysis Make Enriched
Graph. The Bar Chart shows, for each significant GO term, frequency differences
between the membrane and the whole chip datasets (see Figure 11). The Enriched
Graph shows the DAG of significant terms with a node-coloring proportional to
the significance value. This representation helps in understanding the
biological context of functional differences and to find pseudoredundancies in
the results—parent-child
relationships within significant terms—(see Figure 12).
Export results
Once different analyses have been completed the data
can be exported in many different ways. The annotation format (menu File Export Export
Annotations) is the default format for export/import in B2G and simply consists
of a tab-delimited file with two columns, one for sequences and other for
annotation IDs. Another useful export format is GeneSpring, for communication
with this interesting application, which consists of one row per sequence and
three different columns showing the descriptions of the GO terms at the three
main GO categories. Graphs can be saved in png format. Additionally, all
information contained in the Combined Graph can be generated as table
(including sequences, GO IDs, levels, and scores) and exported (Analysis
Export Graph Information).
The analysis presented in this use case took about
15 days to complete. Four days were necessary to obtain the totality of 37,500
blast results from the NCBI while twelve days were required for the
InterProScan at the EBI web server. Mapping and Annotation were ready within a
few hours and one day was necessary to collect and evaluate charts. This shows
that with the adequate tools and some training, functional annotation of a
plant genome-wide sequence collection is in reach within a couple of weeks.
4. Conclusions
Functional annotation of novel sequence data is a key requirement for the successful generation of functional genomics in biological research. The Blast2GO suite has been developed to be a useful support to these approaches, especially (but not exclusively) in nonmodel species. This bioinformatics tool is ideal for plant functional genomics research because of the following: (1) it is suitable for any species but can be also customized for specific needs, (2) it combines high throughput with interactivity and curation, and (3) it is user-friendly and requires low bioinformatics efforts to get it running. In our opinion, the major B2G strength is the combination of functional annotation and data mining on annotation results, which means that, within one tool, researchers can generate functional annotation and assess the functional meaning of their experimental results. Further developments of Blast2GO will reinforce this second aspect thought the integration of the tool with the Babelomics (www.babelomics.org, [38]) and GEPAS suites (www.gepas.org, [39]) for the statistical analysis of functional profiling data.
Acknowledgments
This work has been funded by the Spanish Ramon y Cajal Program and the National Institute of Bioinformatics (a platform of Genoma Espana).