To the editor:
Gene Expression Omnibus (GEO)1 is a public repository for gene expression data. While the amount of data in GEO has grown exponentially, the number of publications citing GEO has only grown linearly. The difficulty in data reuse is the mapping of probes in GEO datasets to established gene identifiers, which can change as annotations for the underlying sequences change2. Therefore, microarray results need to be reevaluated with the latest probe annotations. There have been several previous efforts to reannotate microarray probe identifiers3,4, but only for a few platforms and species.
We built a fully automated system, Array Information Library Universal Navigator (AILUN), to reannotate all types of microarrays in GEO periodically by relating every probe identifier to Entrez Gene identifiers. First, we collected all gene identifiers from Entrez Gene and UniGene and built a universal gene identifier table (UGIT). We then matched each column of every GEO platform with UGIT to find the best matching column and type of external identifier, and annotated each probe identifier with Entrez Gene identifiers. (Supplementary Methods and Supplementary Fig. 1 online).
UGIT contained 75 million gene identifiers of 90 types for 3,585 species. AILUN successfully reannotated 66% gene expression platforms, allowing reuse of 77% of samples across 79 species. The platform annotation coverage was 5 times greater than that in GEO (Table 1), and 94% identical for probes annotated by both AILUN and GEO. To validate the accuracy of annotation, we compared the annotations on Affymetrix U133A 2.0 across AILUN, GEO and NetAffx5 using Brainarray3 as the gold standard, which is based on probe-sequence matching. AILUN performed as well as NetAffx with 97% precision and 97% recall, and outperformed GEO with 98% precision and 86% recall (Supplementary Tables 1,2,3 and Supplementary Discussion online).
The server (http://ailun.stanford.edu) offers four functions to help users reannotate platforms. 'Platform annotation' adds the latest annotations to any uploaded result file. 'Cross-species mapping' maps platform annotations to other species. 'Platform comparison' compares any two platforms to find corresponding probes mapping to the same gene. 'Gene search' finds deposited platforms and samples in GEO for any list of genes.
Note: Supplementary information is available on the Nature Methods website.
References
Barrett, T. et al. Nucleic Acids Res. 35, D760–D765 (2007).
Perez-Iratxeta, C. & Andrade, M.A. BMC Bioinformatics 6, 183 (2005).
Dai, M. et al. Nucleic Acids Res. 33, e175 (2005).
Tsai, J. et al. Genome Biol. 2, Software0002 (2001).
Liu, G. et al. Nucleic Acids Res. 31, 82–86 (2003).
Acknowledgements
Supported by Lucile Packard Foundation for Children's Health, US National Library of Medicine (K22 LM008261), National Institute of General Medical Sciences (R01 GM079719), Howard Hughes Medical Institute, and Pharmaceutical Research and Manufacturers of America Foundation. We thank A. Skrenchuk for computer support and A. Chiang for manuscript review.
Author information
Authors and Affiliations
Supplementary information
Supplementary Text and Figures
Supplementary Fig. 1, Supplementary Table 1–3, Supplementary Discussion and Supplementary Methods (PDF 308 kb)
Rights and permissions
About this article
Cite this article
Chen, R., Li, L. & Butte, A. AILUN: reannotating gene expression data automatically. Nat Methods 4, 879 (2007). https://doi.org/10.1038/nmeth1107-879
Issue Date:
DOI: https://doi.org/10.1038/nmeth1107-879
This article is cited by
-
Novel Therapeutics Identification for Fibrosis in Renal Allograft Using Integrative Informatics Approach
Scientific Reports (2017)
-
Absence of genomic hypomethylation or regulation of cytosine-modifying enzymes with aging in male and female mice
Epigenetics & Chromatin (2016)
-
miR-200 promotes the mesenchymal to epithelial transition by suppressing multiple members of the Zeb2 and Snail1 transcriptional repressor complexes
Oncogene (2016)
-
Comparative Neuroregenerative Effects of C-Phycocyanin and IFN-Beta in a Model of Multiple Sclerosis in Mice
Journal of Neuroimmune Pharmacology (2016)
-
Relating hepatocellular carcinoma tumor samples and cell lines using gene expression data in translational research
BMC Medical Genomics (2015)