Computational solutions for omics data

Berger, Bonnie; Peng, Jian; Singh, Mona

doi:10.1038/nrg3433

Review Article
Published: 18 April 2013

Computational solutions for omics data

Bonnie Berger^1,2^na1,
Jian Peng²^na1 &
Mona Singh³^na1

Nature Reviews Genetics volume 14, pages 333–346 (2013)Cite this article

22k Accesses
204 Citations
27 Altmetric
Metrics details

Subjects

Key Points

The explosive growth of omics data generated in individual laboratories around the world presents new challenges that require innovative computational tools.
Modern indexing techniques for assembly, read mapping and search can solve problems in storing and accessing massive sequencing data at the terabyte scale.
Compressive genomics is one such technique that exploits the redundancy in genomic data to store sequence data in compressed form while allowing efficient and effective searches on these data without decompressing first.
The toolbox for transcriptomic data now includes sophisticated computational methods that are able to discover patterns from unstructured or semi-structured data, borrowed from the fields of data mining and machine learning.
Graph-theoretical techniques, such as network flow and random walks, can assist in the interpretation of diverse types of functional genomics data.
A range of software tools and websites implementing key algorithmic ideas is increasingly available to meet these omics data challenges and such tools are presented in this article.

Abstract

High-throughput experimental technologies are generating increasingly massive and complex genomic data sets. The sheer enormity and heterogeneity of these data threaten to make the arising problems computationally infeasible. Fortunately, powerful algorithmic techniques lead to software that can answer important biomedical questions in practice. In this Review, we sample the algorithmic landscape, focusing on state-of-the-art techniques, the understanding of which will aid the bench biologist in analysing omics data. We spotlight specific examples that have facilitated and enriched analyses of sequence, transcriptomic and network data sets.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: De Bruijn graph of DNA sequence assembly.**

**Figure 2: Application to sequence search.**

**Figure 3: Integrative interactomics applications.**

Computational analysis of cancer genome sequencing data

Article 08 December 2021

Isidro Cortés-Ciriano, Doga C. Gulhan, … Peter J. Park

Challenges and best practices in omics benchmarking

Article 12 January 2024

Thomas G. Brooks, Nicholas F. Lahens, … Gregory R. Grant

GenomicSuperSignature facilitates interpretation of RNA-seq experiments through robust, efficient comparison to public databases

Article Open access 27 June 2022

Sehyun Oh, Ludwig Geistlinger, … Sean Davis

References

Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004).
Article PubMed PubMed Central Google Scholar
Goecks, J., Nekrutenko, A., Taylor, J. & Team, G. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010).
Article PubMed PubMed Central Google Scholar
Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).
Article CAS PubMed PubMed Central Google Scholar
Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).
Article CAS PubMed Google Scholar
Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).
Article CAS PubMed Google Scholar
Kircher, M. & Kelso, J. High-throughput DNA sequencing — concepts and limitations. BioEssays 32, 524–536 (2010).
Article CAS PubMed Google Scholar
Kahn, S. D. On the future of genomic data. Science 331, 728–729 (2011).
Article CAS PubMed Google Scholar
Gross, M. Riding the wave of biological data. Curr. Biol. 21, R204–R206 (2011).
Article CAS PubMed Google Scholar
Huttenhower, C. & Hofmann, O. A quick guide to large-scale genomic data mining. PLoS Comput. Biol. 6, e1000779 (2010).
Article CAS PubMed PubMed Central Google Scholar
Schatz, M., Langmead, B. & Salzberg, S. Cloud computing and the DNA data race. Nature Biotech. 28, 691–693 (2010).
Article CAS Google Scholar
Stein, L. D. The case for cloud computing in genome informatics. Genome Biol. 11, 207 (2010).
Article PubMed PubMed Central Google Scholar
Tringe, S. G. & Rubin, E. M. Metagenomics: DNA sequencing of environmental samples. Nature Rev. Genet. 6, 805–814 (2005).
Article CAS PubMed Google Scholar
Gstaiger, M. & Aebersold, R. Applying mass spectrometry-based proteomics to genetics, genomics and network biology. Nature Rev. Genet. 10, 617–627 (2009).
Article CAS PubMed Google Scholar
Mardis, E. R. Next-generation DNA sequencing methods. Annu. Rev. Genom. Hum. Genet. 9, 387–402 (2008).
Article CAS Google Scholar
Metzker, M. L. Sequencing technologies — the next generation. Nature Rev. Genet. 11, 31–46 (2010).
Article CAS PubMed Google Scholar
Schatz, M. C., Delcher, A. L. & Salzberg, S. L. Assembly of large genomes using second-generation sequencing. Genome Res. 20, 1165–1173 (2010).
Article CAS PubMed PubMed Central Google Scholar
Flicek, P. & Birney, E. Sense from sequence reads: methods for alignment and assembly. Nature Methods 6, S6–S12 (2009).
Article CAS PubMed Google Scholar
Pevzner, P. A., Tang, H. & Waterman, M. S. An Eulerian path approach to DNA fragment assembly. Proc. Natl Acad. Sci. USA 98, 9748–9753 (2001). The EULER assembler introduces the de Bruijn graph and Eulerian path formulation for assembly, a paradigm used in the most popular assemblers.
Article CAS PubMed PubMed Central Google Scholar
Batzoglou, S. et al. ARACHNE: a whole-genome shotgun assembler. Genome Res. 12, 177–189 (2002).
Article PubMed PubMed Central Google Scholar
Jaffe, D. B. et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91–96 (2003).
Article CAS PubMed PubMed Central Google Scholar
Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).
Article CAS PubMed PubMed Central Google Scholar
Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).
Article CAS PubMed PubMed Central Google Scholar
Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008).
Article CAS PubMed PubMed Central Google Scholar
Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).
Article CAS PubMed PubMed Central Google Scholar
Compeau, P. E., Pevzner, P. A. & Tesler, G. How to apply de Bruijn graphs to genome assembly. Nature Biotech. 29, 987–991 (2011).
Article CAS Google Scholar
Simpson, J. T. & Durbin, R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012).
Article CAS PubMed PubMed Central Google Scholar
Earl, D. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2011).
Article CAS PubMed PubMed Central Google Scholar
Vezzi, F., Narzisi, G. & Mishra, B. Reevaluating assembly rvaluations with feature response curves: GAGE and Assemblathons. PLoS ONE 7, e52210 (2012).
Article CAS PubMed PubMed Central Google Scholar
Salzberg, S. L. et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012).
Article CAS PubMed PubMed Central Google Scholar
Kingsford, C., Schatz, M. C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010). This paper analyses complexity issues in genome assembly; the primary algorithmic challenge is that assembly can be complicated by short reads and genomic repeats.
Article CAS PubMed PubMed Central Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nature Methods 9, 357–359 (2012).
Article CAS PubMed PubMed Central Google Scholar
Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009). Bowtie is probably the most widely used FM-index- or BWT-based short-read mapper. It demonstrates that the read-mapping problem can be done accurately even on a personal computer.
Article CAS PubMed PubMed Central Google Scholar
Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).
Article CAS PubMed PubMed Central Google Scholar
Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).
Article CAS PubMed PubMed Central Google Scholar
Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).
Article CAS PubMed Google Scholar
Ferragina, P. & Manzini, G. Indexing compressed text. JACM 52, 552–581 (2005).
Article Google Scholar
Burrows, M. & Wheeler, D. J. A block-sorting lossless data compression algorithm (Digital Equipment Corporation, 1994).
Google Scholar
Hach, F. et al. mrsFAST: a cache-oblivious algorithm for short-read mapping. Nature Methods 7, 576–577 (2010).
Article CAS PubMed PubMed Central Google Scholar
Hsi-Yang Fritz, M., Leinonen, R., Cochrane, G. & Birney, E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011).
Article CAS PubMed PubMed Central Google Scholar
Christley, S., Lu, Y., Li, C. & Xie, X. Human genomes as e-mail attachments. Bioinformatics 25, 274–275 (2009).
Article CAS PubMed Google Scholar
Pinho, A. J., Pratas, D. & Garcia, S. P. GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res. 40, e27 (2012).
Article CAS PubMed Google Scholar
Tembe, W., Lowey, J. & Suh, E. G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26, 2192–2194 (2010).
Article CAS PubMed Google Scholar
Brandon, M. C., Wallace, D. C. & Baldi, P. Data structures and compression algorithms for genomic sequence data. Bioinformatics 25, 1731–1738 (2009).
Article CAS PubMed PubMed Central Google Scholar
Wang, C. & Zhang, D. A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Res. 39, e45 (2011).
Article CAS PubMed PubMed Central Google Scholar
Loh, P. R., Baym, M. & Berger, B. Compressive genomics. Nature Biotech. 30, 627–630 (2012). This paper introduces 'compressive genomics', a general algorithmic paradigm that harnesses redundancy within data sets to speed up analyses by compressing data in such a way as to allow direct computation on the compressed data. Compressed versions of BLAST and BLAT demonstrate search times that scale linearly in the amount of non-redundant data without loss of accuracy.
Article CAS Google Scholar
Deorowicz, S. & Grabowski, S. Compression of DNA sequence reads in FASTQ format. Bioinformatics 27, 860–862 (2011).
Article CAS PubMed Google Scholar
Hach, F., Numanagic, I., Alkan, C. & Sahinalp, S. C. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012).
Article CAS PubMed PubMed Central Google Scholar
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Article CAS PubMed Google Scholar
Kent, W. J. BLAT—the BLAST-Like Alignment Tool. Genome Res. 12, 656–664 (2002).
Article CAS PubMed PubMed Central Google Scholar
Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).
Article CAS PubMed PubMed Central Google Scholar
Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods 5, 621–628 (2008).
Article CAS PubMed Google Scholar
Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517 (2008).
Article CAS PubMed PubMed Central Google Scholar
Ozsolak, F. et al. Direct RNA sequencing. Nature 461, 814–818 (2009).
Article CAS PubMed Google Scholar
Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotech. 31, 46–53 (2012).
Article CAS Google Scholar
Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nature Biotech. 29, 644–652 (2011).
Article CAS Google Scholar
Garber, M., Grabherr, M. G., Guttman, M. & Trapnell, C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nature Methods 8, 469–477 (2011).
Article CAS PubMed Google Scholar
Brown, P. O. & Botstein, D. Exploring the new world of the genome with DNA microarrays. Nature Genet. 21, 33–37 (1999).
Article CAS PubMed Google Scholar
Barrett, T. et al. NCBI GEO: archive for functional genomics data sets-update. Nucleic Acids Res. 41, D991–D995 (2013).
Article CAS PubMed Google Scholar
Butte, A. The use and analysis of microarray data. Nature Rev. Drug Discov. 1, 951–960 (2002).
Article CAS Google Scholar
Allison, D. B., Cui, X., Page, G. P. & Sabripour, M. Microarray data analysis: from disarray to consolidation and consensus. Nature Rev. Genet. 7, 55–65 (2006).
Article CAS PubMed Google Scholar
Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Rev. Genet. 11, 733–739 (2010).
Article CAS PubMed Google Scholar
Shen-Orr, S. S. et al. Cell type-specific gene expression differences in complex tissues. Nature Methods 7, 287–289 (2010). This work describes a linear algebraic approach to model the mixture of gene expression signals of multiple cell types from microarray experiments and to deconvolute the signals separately for each cell type.
Article CAS PubMed PubMed Central Google Scholar
Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protoc. 7, 562–578 (2012).
Article CAS Google Scholar
Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-seq. Bioinformatics 25, 1105–1111 (2009).
Article CAS PubMed PubMed Central Google Scholar
Kim, D. & Salzberg, S. L. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol. 12, R72 (2011).
Article CAS PubMed PubMed Central Google Scholar
Whitney, A. R. et al. Individuality and variation in gene expression patterns in human blood. Proc. Natl Acad. Sci. USA 100, 1896–1901 (2003).
Article CAS PubMed PubMed Central Google Scholar
Lu, P., Nakorchevskiy, A. & Marcotte, E. M. Expression deconvolution: a reinterpretation of DNA microarray data reveals dynamic changes in cell populations. Proc. Natl Acad. Sci. USA 100, 10370–10375 (2003).
Article CAS PubMed PubMed Central Google Scholar
Wang, Y. et al. In silico estimates of tissue components in surgical samples based on expression profiling data. Cancer Res. 70, 6448–6455 (2010).
Article CAS PubMed PubMed Central Google Scholar
Gaujoux, R. & Seoighe, C. Semi-supervised nonnegative matrix factorization for gene expression deconvolution: a case study. Infect. Genet. Evol. 12, 913–921 (2012).
Article CAS PubMed Google Scholar
Clarke, J., Seo, P. & Clarke, B. Statistical expression deconvolution from mixed tissue samples. Bioinformatics 26, 1043–1049 (2010).
Article CAS PubMed PubMed Central Google Scholar
Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).
Article CAS PubMed PubMed Central Google Scholar
Reich, M. et al. GenePattern 2.0. Nature Genet. 38, 500–501 (2006).
Article CAS PubMed Google Scholar
Tanay, A., Sharan, R., Kupiec, M. & Shamir, R. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc. Natl Acad. Sci. USA 101, 2981–2986 (2004).
Article CAS PubMed PubMed Central Google Scholar
Narayanan, M., Vetta, A., Schadt, E. E. & Zhu, J. Simultaneous clustering of multiple gene expression and physical interaction datasets. PLoS Comput. Biol. 6, e1000742 (2010).
Article CAS PubMed PubMed Central Google Scholar
Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).
Article CAS PubMed PubMed Central Google Scholar
Segal, E. et al. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genet. 34, 166–176 (2003). A probabilistic graphical model is constructed to identify regulatory modules, consisting of co-regulated or co-expressed genes, from gene expression data.
Article PubMed Google Scholar
Kim, D., Kim, M. S. & Cho, K. H. The core regulation module of stress-responsive regulatory networks in yeast. Nucleic Acids Res. 40, 8793–8802 (2012).
Article CAS PubMed PubMed Central Google Scholar
Zinman, G. E., Zhong, S. & Bar-Joseph, Z. Biological interaction networks are conserved at the module level. BMC Syst. Biol. 5, 134 (2011).
Article PubMed PubMed Central Google Scholar
Rhrissorrakrai, K. & Gunsalus, K. C. MINE: Module Identification in Networks. BMC Bioinformatics 12, 192 (2011).
Article PubMed PubMed Central Google Scholar
Colak, R. et al. Module discovery by exhaustive search for densely connected, co-expressed regions in biomolecular interaction networks. PLoS ONE 5, e13348 (2010).
Article CAS PubMed PubMed Central Google Scholar
Ali, W. & Deane, C. M. Functionally guided alignment of protein interaction networks for module detection. Bioinformatics 25, 3166–3173 (2009).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y., Xuan, J., de los Reyes, B. G., Clarke, R. & Ressom, H. W. Reverse engineering module networks by PSO-RNN hybrid modeling. BMC Genomics 10 (Suppl. 1), S15 (2009).
Article CAS PubMed PubMed Central Google Scholar
Michoel, T., De Smet, R., Joshi, A., Van de Peer, Y. & Marchal, K. Comparative analysis of module-based versus direct methods for reverse-engineering transcriptional regulatory networks. BMC Syst. Biol. 3, 49 (2009).
Article CAS PubMed PubMed Central Google Scholar
Joshi, A., De Smet, R., Marchal, K., Van de Peer, Y. & Michoel, T. Module networks revisited: computational assessment and prioritization of model predictions. Bioinformatics 25, 490–496 (2009).
Article CAS PubMed Google Scholar
Wang, X., Dalkic, E., Wu, M. & Chan, C. Gene module level analysis: identification to networks and dynamics. Curr. Opin. Biotechnol. 19, 482–491 (2008).
Article CAS PubMed PubMed Central Google Scholar
Hirose, O. et al. Statistical inference of transcriptional module-based gene networks from time course gene expression profiles by using state space models. Bioinformatics 24, 932–942 (2008).
Article CAS PubMed Google Scholar
Litvin, O., Causton, H. C., Chen, B. J. & Pe'er, D. Modularity and interactions in the genetics of gene expression. Proc. Natl Acad. Sci. USA 106, 6441–6446 (2009).
Article PubMed PubMed Central Google Scholar
Akavia, U. D. et al. An integrated approach to uncover drivers of cancer. Cell 143, 1005–1017 (2010). The computational approach CONEXIC implements a module network to integrate different data sets, including CNVs and gene expression, from cancer studies and discover dysregulated genes.
Article CAS PubMed PubMed Central Google Scholar
Maathuis, M. H., Colombo, D., Kalisch, M. & Buhlmann, P. Predicting causal effects in large-scale systems from observational data. Nature Methods 7, 247–248 (2010). This paper describes an algorithm to estimate the effects of perturbations from observational data in gene expression experiments in which the causal relationship is not known between genes.
Article CAS PubMed Google Scholar
Markowetz, F., Kostka, D., Troyanskaya, O. G. & Spang, R. Nested effects models for high-dimensional phenotyping screens. Bioinformatics 23, I305–I312 (2007).
Article CAS PubMed Google Scholar
Prat, Y., Fromer, M., Linial, N. & Linial, M. Recovering key biological constituents through sparse representation of gene expression. Bioinformatics 27, 655–661 (2011).
Article CAS PubMed Google Scholar
Yeung, K. Y. & Ruzzo, W. L. Principal component analysis for clustering gene expression data. Bioinformatics 17, 763–774 (2001).
Article CAS PubMed Google Scholar
Schmid, M. et al. A gene expression map of Arabidopsis thaliana development. Nature Genet. 37, 501–506 (2005). Scalable methods are introduced here that associate expression patterns to phenotypes both to label new expression samples with and to identify marker genes for phenotypes.
Article CAS PubMed Google Scholar
Zhou, X., Kao, M. C. & Wong, W. H. Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl Acad. Sci. USA 99, 12783–12788 (2002).
Article CAS PubMed PubMed Central Google Scholar
Parts, L., Stegle, O., Winn, J. & Durbin, R. Joint genetic analysis of gene expression data with inferred cellular phenotypes. PLoS Genet. 7, e1001276 (2011).
Article CAS PubMed PubMed Central Google Scholar
Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nature Protoc. 7, 500–507 (2012).
Article CAS Google Scholar
Ng, S. et al. PARADIGM-SHIFT predicts the function of mutations in multiple cancers using pathway impact analysis. Bioinformatics 28, i640–i646 (2012).
Article CAS PubMed PubMed Central Google Scholar
Vaske, C. J. et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics 26, i237–i245 (2010).
Article CAS PubMed PubMed Central Google Scholar
The Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).
The Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).
Heiser, L. M. et al. Subtype and pathway specific responses to anticancer compounds in breast cancer. Proc. Natl Acad. Sci. USA 109, 2724–2729 (2012).
Article PubMed Google Scholar
Liu, X., Yu, X., Zack, D. J., Zhu, H. & Qian, J. TiGER: a database for tissue-specific gene expression and regulation. BMC Bioinformatics 9, 271 (2008).
Article CAS PubMed PubMed Central Google Scholar
Ogasawara, O. et al. BodyMap-Xs: anatomical breakdown of 17 million animal ESTs for cross-species comparison of gene expression. Nucleic Acids Res. 34, D628–D631 (2006).
Article CAS PubMed Google Scholar
Sirota, M. et al. Discovery and preclinical validation of drug indications using compendia of public gene expression data. Sci. Transl. Med. 3, 96ra77 (2011).
Article CAS PubMed PubMed Central Google Scholar
Lamb, J. The Connectivity Map: a new tool for biomedical research. Nature Rev. Cancer 7, 54–60 (2007).
Article Google Scholar
Schmid, P. R., Palmer, N. P., Kohane, I. S. & Berger, B. Making sense out of massive data by going beyond differential expression. Proc. Natl Acad. Sci. USA 109, 5594–5599 (2012).
Article PubMed PubMed Central Google Scholar
Palmer, N. P., Schmid, P. R., Berger, B. & Kohane, I. S. A gene expression profile of stem cell pluripotentiality and differentiation is conserved across diverse solid and hematopoietic cancers. Genome Biol. 13, R71 (2012).
Article PubMed PubMed Central Google Scholar
Dudley, J. T., Tibshirani, R., Deshpande, T. & Butte, A. J. Disease signatures are robust across tissues and experiments. Mol. Syst. Biol. 5, 307 (2009).
Article PubMed PubMed Central Google Scholar
Li, W. et al. Integrative analysis of many weighted co-expression networks using tensor computation. PLoS Comput. Biol. 7, e1001106 (2011).
Article CAS PubMed PubMed Central Google Scholar
Franceschini, A. et al. STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808–D815 (2013).
Article CAS PubMed Google Scholar
Croft, D. et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 39, D691–D697 (2011).
Article CAS PubMed Google Scholar
Chatr-aryamontri, A. et al. The BioGRID interaction database: 2013 update. Nucleic Acids Res. 41, D816–D823 (2013).
Article CAS PubMed Google Scholar
Gerstein, M. B. et al. Architecture of the human regulatory network derived from ENCODE data. Nature 489, 91–100 (2012).
Article CAS PubMed PubMed Central Google Scholar
Wong, A. K. et al. IMP: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res. 40, W484–W490 (2012).
Article CAS PubMed PubMed Central Google Scholar
Hartwell, L. H., Hopfield, J. J., Leibler, S. & Murray, A. W. From molecular to modular cell biology. Nature 402, C47–C52 (1999).
Article CAS PubMed Google Scholar
Ideker, T., Ozier, O., Schwikowski, B. & Siegel, A. F. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18 (Suppl. 1), S233–240 (2002).
Article PubMed Google Scholar
Ulitsky, I. & Shamir, R. Identification of functional modules using network topology and high-throughput data. BMC Syst. Biol. 1, 8 (2007). This study uncovers modules in interaction networks such that the components within a module are also similar to each other with respect to expression or another attribute of interest.
Article CAS PubMed PubMed Central Google Scholar
Jiang, P. & Singh, M. SPICi: a fast clustering algorithm for large biological networks. Bioinformatics 26, 1105–1111 (2010).
Article CAS PubMed PubMed Central Google Scholar
Nabieva, E., Jim, K., Agarwal, A., Chazelle, B. & Singh, M. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21 (Suppl. 1), i302–i310 (2005). Network flow-based methods are introduced as a paradigm for propagating information within cellular networks.
Article CAS PubMed Google Scholar
Singh, R. & Berger, B. Influence flow: integrating pathway-specific RNAi data and protein interaction data. International Society for Computational Biology [online], (2007).
Yeger-Lotem, E. et al. Bridging high-throughput genetic and transcriptional data reveals cellular responses to alpha-synuclein toxicity. Nature Genet. 41, 316–323 (2009).
Article CAS PubMed Google Scholar
Lan, A. et al. ResponseNet: revealing signaling and regulatory networks linking genetic and transcriptomic screening data. Nucleic Acids Res. 39, W424–W429 (2011).
Article CAS PubMed PubMed Central Google Scholar
Huang, S. S. & Fraenkel, E. Integrating proteomic, transcriptional, and interactome data reveals hidden components of signaling and regulatory networks. Sci. Signal. 2, ra40 (2009). This paper introduces a Steiner tree formulation to uncover subnetworks connecting a set of seed proteins.
PubMed PubMed Central Google Scholar
Tuncbag, N., McCallum, S., Huang, S. S. & Fraenkel, E. SteinerNet: a web server for integrating 'omic' data to discover hidden components of response pathways. Nucleic Acids Res. 40, W505–W509 (2012).
Article CAS PubMed PubMed Central Google Scholar
Yeang, C. H., Ideker, T. & Jaakkola, T. Physical network models. J. Comput. Biol. 11, 243–262 (2004).
Article CAS PubMed Google Scholar
Tu, Z., Wang, L., Arbeitman, M. N., Chen, T. & Sun, F. An integrative approach for causal gene identification and gene regulatory pathway inference. Bioinformatics 22, e489–e496 (2006).
Article CAS PubMed Google Scholar
Suthram, S. Beyer, A., Karp, R. M., Eldar, Y. & Ideker, T. eQED: an efficient method for interpreting eQTL associations using protein networks. Mol. Syst. Biol. 4, 162 (2008).
Article PubMed PubMed Central Google Scholar
Kim, Y. A., Wuchty, S. & Przytycka, T. M. Identifying causal genes and dysregulated pathways in complex diseases. PLoS Comput. Biol. 7, e1001095 (2011).
Article CAS PubMed PubMed Central Google Scholar
Doyle, P. G. & Snell, J. L. Random Walks and Electric Networks (Mathematical Association of America, 1984).
Google Scholar
Steffen, M., Petti, A., Aach, J., D'Haeseleer, P. & Church, G. Automated modelling of signal transduction networks. BMC Bioinformatics 3, 34 (2002).
Article PubMed PubMed Central Google Scholar
Pandey, J. et al. Functional annotation of regulatory pathways. Bioinformatics 23, i377–i386 (2007).
Article CAS PubMed Google Scholar
Banks, E., Nabieva, E., Chazelle, B. & Singh, M. Organization of physical interactomes as uncovered by network schemas. PLoS Comput. Biol. 4, e1000203 (2008).
Article CAS PubMed PubMed Central Google Scholar
Banks, E., Nabieva, E., Peterson, R. & Singh, M. NetGrep: fast network schema searches in interactomes. Genome Biol. 9, R138 (2008).
Article CAS PubMed PubMed Central Google Scholar
Singh, R., Xu, J. & Berger, B. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc. Natl Acad. Sci. USA 105, 12763–12768 (2008). This paper introduces global network alignment and pioneers the use of spectral methods to solve it. Led to IsoBase, a database of functionally related proteins across protein-protein, genetic interaction and metabolic networks, simultaneously incorporating both sequence and network data.
Article PubMed PubMed Central Google Scholar
Liao, C. S., Lu, K., Baym, M., Singh, R. & Berger, B. IsoRankN: spectral methods for global alignment of multiple protein networks. Bioinformatics 25, i253–i258 (2009).
Article CAS PubMed PubMed Central Google Scholar
Flannick, J., Novak, A. Srinivasan, B. S., McAdams, H. H. & Batzoglou, S. Graemlin: general and robust alignment of multiple large interaction networks. Genome Res. 16, 1169–1181 (2006).
Article CAS PubMed PubMed Central Google Scholar
Koyuturk, M. et al. Pairwise alignment of protein interaction networks. J. Comput. Biol. 13, 182–199 (2006).
Article PubMed Google Scholar
Kelley, B. P. et al. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl Acad. Sci. USA 100, 11394–11399 (2003).
Article CAS PubMed PubMed Central Google Scholar
Atias, N. & Sharan, R. Comparative analysis of protein networks: hard problems, practical solutions. Commun. Acm 55, 88–97 (2012).
Article Google Scholar
Park, D., Singh, R., Baym, M., Liao, C. S. & Berger, B. IsoBase: a database of functionally related proteins across PPI networks. Nucleic Acids Res. 39, D295–D300 (2011).
Article CAS PubMed Google Scholar
Ma, C.-Y. et al. Reconstruction of phyletic trees by global alignment of multiple metabolic networks. BMC Bioinformatics (in the press).
Goh, K. I. et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685–8690 (2007).
Article CAS PubMed PubMed Central Google Scholar
Rossin, E. J. et al. Proteins encoded in genomic regions associated with immune-mediated disease physically interact and suggest underlying biology. PLoS Genet. 7, e1001273 (2011).
Article CAS PubMed PubMed Central Google Scholar
Navlakha, S. & Kingsford, C. The power of protein interaction networks for associating genes with diseases. Bioinformatics 26, 1057–1063 (2010).
Article CAS PubMed PubMed Central Google Scholar
Vanunu, O., Magger, O., Ruppin, E., Shlomi, T. & Sharan, R. Associating genes and protein complexes with disease via network propagation. PLoS Comput. Biol. 6, e1000641 (2010).
Article CAS PubMed PubMed Central Google Scholar
Kohler, S., Bauer, S., Horn, D. & Robinson, P. N. Walking the interactome for prioritization of candidate disease genes. Am. J. Hum. Genet. 82, 949–958 (2008). This paper introduces random-walk based approaches for prioritizing disease genes using interaction networks.
Article CAS PubMed PubMed Central Google Scholar
Erten, S., Bebek, G., Ewing, R. M. & Koyuturk, M. DADA: degree-aware algorithms for network-based disease gene prioritization. BioData Min. 4, 19 (2011).
Article PubMed PubMed Central Google Scholar
Vandin, F., Upfal, E. & Raphael, B. J. Algorithms for detecting significantly mutated pathways in cancer. J. Comput. Biol. 18, 507–522 (2011). The authors develop a flow-based and statistical approach for analysing genes mutated in cancers within their network context in order to identify significantly mutated subnetworks.
Article CAS PubMed Google Scholar
Cerami, E., Demir, E., Schultz, N., Taylor, B. S. & Sander, C. Automated network analysis identifies core pathways in glioblastoma. PLoS ONE 5, e8918 (2010).
Article CAS PubMed PubMed Central Google Scholar
Kumar, P., Henikoff, S. & Ng, P. C. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nature Protoc. 4, 1073–1081 (2009).
Article CAS Google Scholar
Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nature Methods 7, 248–249 (2010).
Article CAS PubMed PubMed Central Google Scholar
Yandell, M. et al. A probabilistic disease-gene finder for personal genomes. Genome Res. 21, 1529–1542 (2011).
Article CAS PubMed PubMed Central Google Scholar
Vandin, F., Upfal, E. & Raphael, B. J. De novo discovery of mutated driver pathways in cancer. Genome Res. 22, 375–385 (2012).
Article CAS PubMed PubMed Central Google Scholar
Chowdhury, S. A. & Koyuturk, M. Identification of coordinately dysregulated subnetworks in complex phenotypes. Pac. Symp. Biocomput. 2010, 133–144 (2010).
Google Scholar
Ulitsky, I., Krishnamurthy, A., Karp, R. M. & Shamir, R. DEGAS: de novo discovery of dysregulated pathways in human diseases. PLoS ONE 5, e13367 (2010).
Article CAS PubMed PubMed Central Google Scholar
Cho, D.-Y., Kim, Y.-A. & Przytycka, T. M. Network biology approach to complex diseases. PLoS Comput. Biol. (in the press).
Wang, Z., Gerstein, M. & Snyder, M. RNA-seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 57–63 (2009).
Article CAS PubMed Google Scholar
Furey, T. S. ChIP-seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions. Nature Rev. Genet. 13, 840–852 (2012).
Article CAS PubMed Google Scholar
Hafner, M., Lianoglou, S., Tuschl, T. & Betel, D. Genome-wide identification of miRNA targets by PAR-CLIP. Methods 58, 94–105 (2012).
Article CAS PubMed PubMed Central Google Scholar
Wang, E. T. et al. Transcriptome-wide regulation of pre-mRNA splicing and mRNA localization by muscleblind proteins. Cell 150, 710–724 (2012).
Article CAS PubMed PubMed Central Google Scholar
Ascano, M., Hafner, M., Cekan, P., Gerstberger, S. & Tuschl, T. Identification of RNA-protein interaction networks using PAR-CLIP. Wiley Interdiscip. Rev. RNA 3, 159–177 (2012).
Article CAS PubMed Google Scholar
Jungkamp, A. C. et al. In vivo and transcriptome-wide identification of RNA binding protein target sites. Mol. Cell 44, 828–840 (2011).
Article CAS PubMed PubMed Central Google Scholar
Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–223 (2009).
Article CAS PubMed PubMed Central Google Scholar
Meyer, L. R. et al. The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res. 41, D64–D69 (2013).
Article CAS PubMed Google Scholar
de Souza, N. The ENCODE project. Nature Methods 9, 1046–1046 (2012).
Article CAS PubMed Google Scholar
Gerstein, M. B. et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE Project. Science 330, 1775–1787 (2010).
Article CAS PubMed PubMed Central Google Scholar
Manber, U. & Myers, G. Suffix Arrays — a new method for online string searches. Siam J. Comput. 22, 935–948 (1993).
Article Google Scholar
Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714 (2008).
Article CAS PubMed Google Scholar
Paten, B. et al. Cactus: algorithms for genome multiple sequence alignment. Genome Res. 21, 1512–1528 (2011).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors thank and L. Cowen for valuable feedback. B.B. thanks the US National Institutes of Health (NIH) for grant GM081871. M.S. thanks the NIH for grant GM076275 and US National Science Foundation (NSF) for grant ABI0850063.

Author information

Bonnie Berger, Jian Peng and Mona Singh: All authors contributed equally to this work.

Authors and Affiliations

Department of Mathematics and Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, 02139, Massachusetts, USA
Bonnie Berger
Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, 02139, Massachusetts, USA
Bonnie Berger & Jian Peng
Department of Computer Science and the Lewis–Sigler Institute for Integrative Genomics, Princeton University, Princeton, 08542, New Jersey, USA
Mona Singh

Authors

Bonnie Berger
View author publications
You can also search for this author in PubMed Google Scholar
Jian Peng
View author publications
You can also search for this author in PubMed Google Scholar
Mona Singh
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Bonnie Berger.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Glossary

Cloud computing: The use of computing resources distributed in the Internet to store, manage and analyse data, rather than doing so on a local server or personal computer.
Parallel computing: A form of computation that allows numerous calculations to be carried out simultaneously, thereby accelerating computation. On the basis of this principle, many large-scale computational tasks can then be divided into smaller ones and solved on multiple machines concurrently.
Machine learning techniques: Empirical data are taken as input, the relationship among the data is mathematically or statistically modelled, and patterns or predictions are generated. Supervised learning algorithms infer a function from labelled data features and predict labels on future input; unsupervised learning algorithms model the patterns or the distribution of a given unlabelled data set.
Parallel dynamic programming: A technique that splits a large dynamic programming problem, usually by filling a table that can avoid redundant calculation, into a number of subproblems and computes all subproblems in parallel using multiple central processing units (CPUs). The computing speed-up scales almost linearly with the number of CPUs.
Multicore computer processing units: (Multicore CPUs). Single computing processors with two or more independent computing units (called cores). Running multiple instructions on multiple cores at the same time can increase the overall speed of programs.
Cache-oblivious algorithm: Takes advantage of the cache system of the central processing unit (that is, the local memory of frequently accessed data) to avoid expensive memory access operations and thus to improve efficiency; the intrinsic design of these algorithms does not require computer programs to be tuned for machines with different cache systems.
Linear mixed model: A statistical model that models the observed effects from multiple different hidden factors; the effects are additively mixed according to the proportions of their corresponding factors.
Matrix factorization: A method for decomposing a matrix into the product of two matrices. It can be applied to identify individual factors involved in a mixed observation.
Differential geometry: A mathematical discipline for studying geometric objects, such as curves and surfaces, using the techniques of differential and integral calculus.
Linear programming: A mathematical program for the optimization of a linear objective function, subject to linear constraints. Such functions capture the linear relationship between variables for the problem being optimized.
Principle component analysis: A tool for transforming a set of observations with correlated variables into a set of linearly independent variables called principle components, making sure that the first principle component accounts for the largest variability of the data.
Copy number variant: (CNV). Corresponds to abnormal number of copies of one or more segments in the genome. CNVs can be caused by structural rearrangements of the genome such as deletions, duplications, inversions and translocations.
Bayesian network: A statistical model that describes the distribution of a set of random variables by a directed acyclic graph that represents the relationship among the random variables. For example, in a Bayesian network for a regulatory relationship for a set of genes, each variable represents a gene and each directed edge denotes either activating or repressing regulation between two genes.
Steiner tree problem: Formulated on a network to find a minimum-length subnetwork that interconnects a set of seed nodes. Any two seed nodes may be connected by an edge or a path through other nodes.
Random walk: A mathematical formulation of a number of successive random steps on a graph. It has been widely used to explain stochastic observations, such as diffusion in biological networks.
Eigenvalue problem: The aim of this is to find a non-zero vector (that is, eigenvector), given a square matrix, such that the multiplication of the two is only different by a scalar factor.
Set cover: Given a set of elements and subsets, the goal is to find the minimum number of subsets that cover all the elements.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Berger, B., Peng, J. & Singh, M. Computational solutions for omics data. Nat Rev Genet 14, 333–346 (2013). https://doi.org/10.1038/nrg3433

Download citation

Published: 18 April 2013
Issue Date: May 2013
DOI: https://doi.org/10.1038/nrg3433

This article is cited by

DNA-framework-based multidimensional molecular classifiers for cancer diagnosis
- Fangfei Yin
- Haipei Zhao
- Chunhai Fan
Nature Nanotechnology (2023)
Detecting protein complexes with multiple properties by an adaptive harmony search algorithm
- Rongquan Wang
- Caixia Wang
- Huimin Ma
BMC Bioinformatics (2022)
A novel liver cancer diagnosis method based on patient similarity network and DenseGCN
- Ge Zhang
- Zhen Peng
- Huimin Luo
Scientific Reports (2022)
Precision medicine: the precision gap in rheumatic disease
- Chung M. A. Lin
- Faye A. H. Cooles
- John D. Isaacs
Nature Reviews Rheumatology (2022)
Transcription Factor Activation Profiles (TFAP) identify compounds promoting differentiation of Acute Myeloid Leukemia cell lines
- Federica Riccio
- Elisa Micarelli
- Gianni Cesareni
Cell Death Discovery (2022)