Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Computational solutions for omics data

Key Points

  • The explosive growth of omics data generated in individual laboratories around the world presents new challenges that require innovative computational tools.

  • Modern indexing techniques for assembly, read mapping and search can solve problems in storing and accessing massive sequencing data at the terabyte scale.

  • Compressive genomics is one such technique that exploits the redundancy in genomic data to store sequence data in compressed form while allowing efficient and effective searches on these data without decompressing first.

  • The toolbox for transcriptomic data now includes sophisticated computational methods that are able to discover patterns from unstructured or semi-structured data, borrowed from the fields of data mining and machine learning.

  • Graph-theoretical techniques, such as network flow and random walks, can assist in the interpretation of diverse types of functional genomics data.

  • A range of software tools and websites implementing key algorithmic ideas is increasingly available to meet these omics data challenges and such tools are presented in this article.

Abstract

High-throughput experimental technologies are generating increasingly massive and complex genomic data sets. The sheer enormity and heterogeneity of these data threaten to make the arising problems computationally infeasible. Fortunately, powerful algorithmic techniques lead to software that can answer important biomedical questions in practice. In this Review, we sample the algorithmic landscape, focusing on state-of-the-art techniques, the understanding of which will aid the bench biologist in analysing omics data. We spotlight specific examples that have facilitated and enriched analyses of sequence, transcriptomic and network data sets.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: De Bruijn graph of DNA sequence assembly.
Figure 2: Application to sequence search.
Figure 3: Integrative interactomics applications.

Similar content being viewed by others

References

  1. Gentleman, R. C. et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5, R80 (2004).

    Article  PubMed  PubMed Central  Google Scholar 

  2. Goecks, J., Nekrutenko, A., Taylor, J. & Team, G. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  3. Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

    Article  CAS  PubMed  Google Scholar 

  5. Venter, J. C. et al. The sequence of the human genome. Science 291, 1304–1351 (2001).

    Article  CAS  PubMed  Google Scholar 

  6. Kircher, M. & Kelso, J. High-throughput DNA sequencing — concepts and limitations. BioEssays 32, 524–536 (2010).

    Article  CAS  PubMed  Google Scholar 

  7. Kahn, S. D. On the future of genomic data. Science 331, 728–729 (2011).

    Article  CAS  PubMed  Google Scholar 

  8. Gross, M. Riding the wave of biological data. Curr. Biol. 21, R204–R206 (2011).

    Article  CAS  PubMed  Google Scholar 

  9. Huttenhower, C. & Hofmann, O. A quick guide to large-scale genomic data mining. PLoS Comput. Biol. 6, e1000779 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Schatz, M., Langmead, B. & Salzberg, S. Cloud computing and the DNA data race. Nature Biotech. 28, 691–693 (2010).

    Article  CAS  Google Scholar 

  11. Stein, L. D. The case for cloud computing in genome informatics. Genome Biol. 11, 207 (2010).

    Article  PubMed  PubMed Central  Google Scholar 

  12. Tringe, S. G. & Rubin, E. M. Metagenomics: DNA sequencing of environmental samples. Nature Rev. Genet. 6, 805–814 (2005).

    Article  CAS  PubMed  Google Scholar 

  13. Gstaiger, M. & Aebersold, R. Applying mass spectrometry-based proteomics to genetics, genomics and network biology. Nature Rev. Genet. 10, 617–627 (2009).

    Article  CAS  PubMed  Google Scholar 

  14. Mardis, E. R. Next-generation DNA sequencing methods. Annu. Rev. Genom. Hum. Genet. 9, 387–402 (2008).

    Article  CAS  Google Scholar 

  15. Metzker, M. L. Sequencing technologies — the next generation. Nature Rev. Genet. 11, 31–46 (2010).

    Article  CAS  PubMed  Google Scholar 

  16. Schatz, M. C., Delcher, A. L. & Salzberg, S. L. Assembly of large genomes using second-generation sequencing. Genome Res. 20, 1165–1173 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Flicek, P. & Birney, E. Sense from sequence reads: methods for alignment and assembly. Nature Methods 6, S6–S12 (2009).

    Article  CAS  PubMed  Google Scholar 

  18. Pevzner, P. A., Tang, H. & Waterman, M. S. An Eulerian path approach to DNA fragment assembly. Proc. Natl Acad. Sci. USA 98, 9748–9753 (2001). The EULER assembler introduces the de Bruijn graph and Eulerian path formulation for assembly, a paradigm used in the most popular assemblers.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  19. Batzoglou, S. et al. ARACHNE: a whole-genome shotgun assembler. Genome Res. 12, 177–189 (2002).

    Article  PubMed  PubMed Central  Google Scholar 

  20. Jaffe, D. B. et al. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13, 91–96 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Zerbino, D. R. & Birney, E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 18, 821–829 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Li, R. et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 20, 265–272 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Butler, J. et al. ALLPATHS: de novo assembly of whole-genome shotgun microreads. Genome Res. 18, 810–820 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  24. Simpson, J. T. et al. ABySS: a parallel assembler for short read sequence data. Genome Res. 19, 1117–1123 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  25. Compeau, P. E., Pevzner, P. A. & Tesler, G. How to apply de Bruijn graphs to genome assembly. Nature Biotech. 29, 987–991 (2011).

    Article  CAS  Google Scholar 

  26. Simpson, J. T. & Durbin, R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 22, 549–556 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Earl, D. et al. Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 21, 2224–2241 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Vezzi, F., Narzisi, G. & Mishra, B. Reevaluating assembly rvaluations with feature response curves: GAGE and Assemblathons. PLoS ONE 7, e52210 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Salzberg, S. L. et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 22, 557–567 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Kingsford, C., Schatz, M. C. & Pop, M. Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 11, 21 (2010). This paper analyses complexity issues in genome assembly; the primary algorithmic challenge is that assembly can be complicated by short reads and genomic repeats.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nature Methods 9, 357–359 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009). Bowtie is probably the most widely used FM-index- or BWT-based short-read mapper. It demonstrates that the read-mapping problem can be done accurately even on a personal computer.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  35. Li, R. et al. SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 25, 1966–1967 (2009).

    Article  CAS  PubMed  Google Scholar 

  36. Ferragina, P. & Manzini, G. Indexing compressed text. JACM 52, 552–581 (2005).

    Article  Google Scholar 

  37. Burrows, M. & Wheeler, D. J. A block-sorting lossless data compression algorithm (Digital Equipment Corporation, 1994).

    Google Scholar 

  38. Hach, F. et al. mrsFAST: a cache-oblivious algorithm for short-read mapping. Nature Methods 7, 576–577 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  39. Hsi-Yang Fritz, M., Leinonen, R., Cochrane, G. & Birney, E. Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Res. 21, 734–740 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  40. Christley, S., Lu, Y., Li, C. & Xie, X. Human genomes as e-mail attachments. Bioinformatics 25, 274–275 (2009).

    Article  CAS  PubMed  Google Scholar 

  41. Pinho, A. J., Pratas, D. & Garcia, S. P. GReEn: a tool for efficient compression of genome resequencing data. Nucleic Acids Res. 40, e27 (2012).

    Article  CAS  PubMed  Google Scholar 

  42. Tembe, W., Lowey, J. & Suh, E. G-SQZ: compact encoding of genomic sequence and quality data. Bioinformatics 26, 2192–2194 (2010).

    Article  CAS  PubMed  Google Scholar 

  43. Brandon, M. C., Wallace, D. C. & Baldi, P. Data structures and compression algorithms for genomic sequence data. Bioinformatics 25, 1731–1738 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  44. Wang, C. & Zhang, D. A novel compression tool for efficient storage of genome resequencing data. Nucleic Acids Res. 39, e45 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  45. Loh, P. R., Baym, M. & Berger, B. Compressive genomics. Nature Biotech. 30, 627–630 (2012). This paper introduces 'compressive genomics', a general algorithmic paradigm that harnesses redundancy within data sets to speed up analyses by compressing data in such a way as to allow direct computation on the compressed data. Compressed versions of BLAST and BLAT demonstrate search times that scale linearly in the amount of non-redundant data without loss of accuracy.

    Article  CAS  Google Scholar 

  46. Deorowicz, S. & Grabowski, S. Compression of DNA sequence reads in FASTQ format. Bioinformatics 27, 860–862 (2011).

    Article  CAS  PubMed  Google Scholar 

  47. Hach, F., Numanagic, I., Alkan, C. & Sahinalp, S. C. SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28, 3051–3057 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  48. Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    Article  CAS  PubMed  Google Scholar 

  49. Kent, W. J. BLAT—the BLAST-Like Alignment Tool. Genome Res. 12, 656–664 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  50. Levy, S. et al. The diploid genome sequence of an individual human. PLoS Biol. 5, e254 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  51. Mortazavi, A., Williams, B. A., McCue, K., Schaeffer, L. & Wold, B. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nature Methods 5, 621–628 (2008).

    Article  CAS  PubMed  Google Scholar 

  52. Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  53. Ozsolak, F. et al. Direct RNA sequencing. Nature 461, 814–818 (2009).

    Article  CAS  PubMed  Google Scholar 

  54. Trapnell, C. et al. Differential analysis of gene regulation at transcript resolution with RNA-seq. Nature Biotech. 31, 46–53 (2012).

    Article  CAS  Google Scholar 

  55. Grabherr, M. G. et al. Full-length transcriptome assembly from RNA-seq data without a reference genome. Nature Biotech. 29, 644–652 (2011).

    Article  CAS  Google Scholar 

  56. Garber, M., Grabherr, M. G., Guttman, M. & Trapnell, C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nature Methods 8, 469–477 (2011).

    Article  CAS  PubMed  Google Scholar 

  57. Brown, P. O. & Botstein, D. Exploring the new world of the genome with DNA microarrays. Nature Genet. 21, 33–37 (1999).

    Article  CAS  PubMed  Google Scholar 

  58. Barrett, T. et al. NCBI GEO: archive for functional genomics data sets-update. Nucleic Acids Res. 41, D991–D995 (2013).

    Article  CAS  PubMed  Google Scholar 

  59. Butte, A. The use and analysis of microarray data. Nature Rev. Drug Discov. 1, 951–960 (2002).

    Article  CAS  Google Scholar 

  60. Allison, D. B., Cui, X., Page, G. P. & Sabripour, M. Microarray data analysis: from disarray to consolidation and consensus. Nature Rev. Genet. 7, 55–65 (2006).

    Article  CAS  PubMed  Google Scholar 

  61. Leek, J. T. et al. Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Rev. Genet. 11, 733–739 (2010).

    Article  CAS  PubMed  Google Scholar 

  62. Shen-Orr, S. S. et al. Cell type-specific gene expression differences in complex tissues. Nature Methods 7, 287–289 (2010). This work describes a linear algebraic approach to model the mixture of gene expression signals of multiple cell types from microarray experiments and to deconvolute the signals separately for each cell type.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  63. Trapnell, C. et al. Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protoc. 7, 562–578 (2012).

    Article  CAS  Google Scholar 

  64. Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-seq. Bioinformatics 25, 1105–1111 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  65. Kim, D. & Salzberg, S. L. TopHat-Fusion: an algorithm for discovery of novel fusion transcripts. Genome Biol. 12, R72 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  66. Whitney, A. R. et al. Individuality and variation in gene expression patterns in human blood. Proc. Natl Acad. Sci. USA 100, 1896–1901 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  67. Lu, P., Nakorchevskiy, A. & Marcotte, E. M. Expression deconvolution: a reinterpretation of DNA microarray data reveals dynamic changes in cell populations. Proc. Natl Acad. Sci. USA 100, 10370–10375 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  68. Wang, Y. et al. In silico estimates of tissue components in surgical samples based on expression profiling data. Cancer Res. 70, 6448–6455 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  69. Gaujoux, R. & Seoighe, C. Semi-supervised nonnegative matrix factorization for gene expression deconvolution: a case study. Infect. Genet. Evol. 12, 913–921 (2012).

    Article  CAS  PubMed  Google Scholar 

  70. Clarke, J., Seo, P. & Clarke, B. Statistical expression deconvolution from mixed tissue samples. Bioinformatics 26, 1043–1049 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  71. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 15545–15550 (2005).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  72. Reich, M. et al. GenePattern 2.0. Nature Genet. 38, 500–501 (2006).

    Article  CAS  PubMed  Google Scholar 

  73. Tanay, A., Sharan, R., Kupiec, M. & Shamir, R. Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc. Natl Acad. Sci. USA 101, 2981–2986 (2004).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  74. Narayanan, M., Vetta, A., Schadt, E. E. & Zhu, J. Simultaneous clustering of multiple gene expression and physical interaction datasets. PLoS Comput. Biol. 6, e1000742 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  75. Anders, S. & Huber, W. Differential expression analysis for sequence count data. Genome Biol. 11, R106 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  76. Segal, E. et al. Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genet. 34, 166–176 (2003). A probabilistic graphical model is constructed to identify regulatory modules, consisting of co-regulated or co-expressed genes, from gene expression data.

    Article  PubMed  Google Scholar 

  77. Kim, D., Kim, M. S. & Cho, K. H. The core regulation module of stress-responsive regulatory networks in yeast. Nucleic Acids Res. 40, 8793–8802 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  78. Zinman, G. E., Zhong, S. & Bar-Joseph, Z. Biological interaction networks are conserved at the module level. BMC Syst. Biol. 5, 134 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  79. Rhrissorrakrai, K. & Gunsalus, K. C. MINE: Module Identification in Networks. BMC Bioinformatics 12, 192 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  80. Colak, R. et al. Module discovery by exhaustive search for densely connected, co-expressed regions in biomolecular interaction networks. PLoS ONE 5, e13348 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  81. Ali, W. & Deane, C. M. Functionally guided alignment of protein interaction networks for module detection. Bioinformatics 25, 3166–3173 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  82. Zhang, Y., Xuan, J., de los Reyes, B. G., Clarke, R. & Ressom, H. W. Reverse engineering module networks by PSO-RNN hybrid modeling. BMC Genomics 10 (Suppl. 1), S15 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  83. Michoel, T., De Smet, R., Joshi, A., Van de Peer, Y. & Marchal, K. Comparative analysis of module-based versus direct methods for reverse-engineering transcriptional regulatory networks. BMC Syst. Biol. 3, 49 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  84. Joshi, A., De Smet, R., Marchal, K., Van de Peer, Y. & Michoel, T. Module networks revisited: computational assessment and prioritization of model predictions. Bioinformatics 25, 490–496 (2009).

    Article  CAS  PubMed  Google Scholar 

  85. Wang, X., Dalkic, E., Wu, M. & Chan, C. Gene module level analysis: identification to networks and dynamics. Curr. Opin. Biotechnol. 19, 482–491 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  86. Hirose, O. et al. Statistical inference of transcriptional module-based gene networks from time course gene expression profiles by using state space models. Bioinformatics 24, 932–942 (2008).

    Article  CAS  PubMed  Google Scholar 

  87. Litvin, O., Causton, H. C., Chen, B. J. & Pe'er, D. Modularity and interactions in the genetics of gene expression. Proc. Natl Acad. Sci. USA 106, 6441–6446 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  88. Akavia, U. D. et al. An integrated approach to uncover drivers of cancer. Cell 143, 1005–1017 (2010). The computational approach CONEXIC implements a module network to integrate different data sets, including CNVs and gene expression, from cancer studies and discover dysregulated genes.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  89. Maathuis, M. H., Colombo, D., Kalisch, M. & Buhlmann, P. Predicting causal effects in large-scale systems from observational data. Nature Methods 7, 247–248 (2010). This paper describes an algorithm to estimate the effects of perturbations from observational data in gene expression experiments in which the causal relationship is not known between genes.

    Article  CAS  PubMed  Google Scholar 

  90. Markowetz, F., Kostka, D., Troyanskaya, O. G. & Spang, R. Nested effects models for high-dimensional phenotyping screens. Bioinformatics 23, I305–I312 (2007).

    Article  CAS  PubMed  Google Scholar 

  91. Prat, Y., Fromer, M., Linial, N. & Linial, M. Recovering key biological constituents through sparse representation of gene expression. Bioinformatics 27, 655–661 (2011).

    Article  CAS  PubMed  Google Scholar 

  92. Yeung, K. Y. & Ruzzo, W. L. Principal component analysis for clustering gene expression data. Bioinformatics 17, 763–774 (2001).

    Article  CAS  PubMed  Google Scholar 

  93. Schmid, M. et al. A gene expression map of Arabidopsis thaliana development. Nature Genet. 37, 501–506 (2005). Scalable methods are introduced here that associate expression patterns to phenotypes both to label new expression samples with and to identify marker genes for phenotypes.

    Article  CAS  PubMed  Google Scholar 

  94. Zhou, X., Kao, M. C. & Wong, W. H. Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl Acad. Sci. USA 99, 12783–12788 (2002).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  95. Parts, L., Stegle, O., Winn, J. & Durbin, R. Joint genetic analysis of gene expression data with inferred cellular phenotypes. PLoS Genet. 7, e1001276 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  96. Stegle, O., Parts, L., Piipari, M., Winn, J. & Durbin, R. Using probabilistic estimation of expression residuals (PEER) to obtain increased power and interpretability of gene expression analyses. Nature Protoc. 7, 500–507 (2012).

    Article  CAS  Google Scholar 

  97. Ng, S. et al. PARADIGM-SHIFT predicts the function of mutations in multiple cancers using pathway impact analysis. Bioinformatics 28, i640–i646 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  98. Vaske, C. J. et al. Inference of patient-specific pathway activities from multi-dimensional cancer genomics data using PARADIGM. Bioinformatics 26, i237–i245 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  99. The Cancer Genome Atlas Network. Comprehensive molecular characterization of human colon and rectal cancer. Nature 487, 330–337 (2012).

  100. The Cancer Genome Atlas Network. Comprehensive molecular portraits of human breast tumours. Nature 490, 61–70 (2012).

  101. Heiser, L. M. et al. Subtype and pathway specific responses to anticancer compounds in breast cancer. Proc. Natl Acad. Sci. USA 109, 2724–2729 (2012).

    Article  PubMed  Google Scholar 

  102. Liu, X., Yu, X., Zack, D. J., Zhu, H. & Qian, J. TiGER: a database for tissue-specific gene expression and regulation. BMC Bioinformatics 9, 271 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  103. Ogasawara, O. et al. BodyMap-Xs: anatomical breakdown of 17 million animal ESTs for cross-species comparison of gene expression. Nucleic Acids Res. 34, D628–D631 (2006).

    Article  CAS  PubMed  Google Scholar 

  104. Sirota, M. et al. Discovery and preclinical validation of drug indications using compendia of public gene expression data. Sci. Transl. Med. 3, 96ra77 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  105. Lamb, J. The Connectivity Map: a new tool for biomedical research. Nature Rev. Cancer 7, 54–60 (2007).

    Article  Google Scholar 

  106. Schmid, P. R., Palmer, N. P., Kohane, I. S. & Berger, B. Making sense out of massive data by going beyond differential expression. Proc. Natl Acad. Sci. USA 109, 5594–5599 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  107. Palmer, N. P., Schmid, P. R., Berger, B. & Kohane, I. S. A gene expression profile of stem cell pluripotentiality and differentiation is conserved across diverse solid and hematopoietic cancers. Genome Biol. 13, R71 (2012).

    Article  PubMed  PubMed Central  Google Scholar 

  108. Dudley, J. T., Tibshirani, R., Deshpande, T. & Butte, A. J. Disease signatures are robust across tissues and experiments. Mol. Syst. Biol. 5, 307 (2009).

    Article  PubMed  PubMed Central  Google Scholar 

  109. Li, W. et al. Integrative analysis of many weighted co-expression networks using tensor computation. PLoS Comput. Biol. 7, e1001106 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  110. Franceschini, A. et al. STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 41, D808–D815 (2013).

    Article  CAS  PubMed  Google Scholar 

  111. Croft, D. et al. Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 39, D691–D697 (2011).

    Article  CAS  PubMed  Google Scholar 

  112. Chatr-aryamontri, A. et al. The BioGRID interaction database: 2013 update. Nucleic Acids Res. 41, D816–D823 (2013).

    Article  CAS  PubMed  Google Scholar 

  113. Gerstein, M. B. et al. Architecture of the human regulatory network derived from ENCODE data. Nature 489, 91–100 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  114. Wong, A. K. et al. IMP: a multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res. 40, W484–W490 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  115. Hartwell, L. H., Hopfield, J. J., Leibler, S. & Murray, A. W. From molecular to modular cell biology. Nature 402, C47–C52 (1999).

    Article  CAS  PubMed  Google Scholar 

  116. Ideker, T., Ozier, O., Schwikowski, B. & Siegel, A. F. Discovering regulatory and signalling circuits in molecular interaction networks. Bioinformatics 18 (Suppl. 1), S233–240 (2002).

    Article  PubMed  Google Scholar 

  117. Ulitsky, I. & Shamir, R. Identification of functional modules using network topology and high-throughput data. BMC Syst. Biol. 1, 8 (2007). This study uncovers modules in interaction networks such that the components within a module are also similar to each other with respect to expression or another attribute of interest.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  118. Jiang, P. & Singh, M. SPICi: a fast clustering algorithm for large biological networks. Bioinformatics 26, 1105–1111 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  119. Nabieva, E., Jim, K., Agarwal, A., Chazelle, B. & Singh, M. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21 (Suppl. 1), i302–i310 (2005). Network flow-based methods are introduced as a paradigm for propagating information within cellular networks.

    Article  CAS  PubMed  Google Scholar 

  120. Singh, R. & Berger, B. Influence flow: integrating pathway-specific RNAi data and protein interaction data. International Society for Computational Biology [online], (2007).

  121. Yeger-Lotem, E. et al. Bridging high-throughput genetic and transcriptional data reveals cellular responses to alpha-synuclein toxicity. Nature Genet. 41, 316–323 (2009).

    Article  CAS  PubMed  Google Scholar 

  122. Lan, A. et al. ResponseNet: revealing signaling and regulatory networks linking genetic and transcriptomic screening data. Nucleic Acids Res. 39, W424–W429 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  123. Huang, S. S. & Fraenkel, E. Integrating proteomic, transcriptional, and interactome data reveals hidden components of signaling and regulatory networks. Sci. Signal. 2, ra40 (2009). This paper introduces a Steiner tree formulation to uncover subnetworks connecting a set of seed proteins.

    PubMed  PubMed Central  Google Scholar 

  124. Tuncbag, N., McCallum, S., Huang, S. S. & Fraenkel, E. SteinerNet: a web server for integrating 'omic' data to discover hidden components of response pathways. Nucleic Acids Res. 40, W505–W509 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  125. Yeang, C. H., Ideker, T. & Jaakkola, T. Physical network models. J. Comput. Biol. 11, 243–262 (2004).

    Article  CAS  PubMed  Google Scholar 

  126. Tu, Z., Wang, L., Arbeitman, M. N., Chen, T. & Sun, F. An integrative approach for causal gene identification and gene regulatory pathway inference. Bioinformatics 22, e489–e496 (2006).

    Article  CAS  PubMed  Google Scholar 

  127. Suthram, S. Beyer, A., Karp, R. M., Eldar, Y. & Ideker, T. eQED: an efficient method for interpreting eQTL associations using protein networks. Mol. Syst. Biol. 4, 162 (2008).

    Article  PubMed  PubMed Central  Google Scholar 

  128. Kim, Y. A., Wuchty, S. & Przytycka, T. M. Identifying causal genes and dysregulated pathways in complex diseases. PLoS Comput. Biol. 7, e1001095 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  129. Doyle, P. G. & Snell, J. L. Random Walks and Electric Networks (Mathematical Association of America, 1984).

    Google Scholar 

  130. Steffen, M., Petti, A., Aach, J., D'Haeseleer, P. & Church, G. Automated modelling of signal transduction networks. BMC Bioinformatics 3, 34 (2002).

    Article  PubMed  PubMed Central  Google Scholar 

  131. Pandey, J. et al. Functional annotation of regulatory pathways. Bioinformatics 23, i377–i386 (2007).

    Article  CAS  PubMed  Google Scholar 

  132. Banks, E., Nabieva, E., Chazelle, B. & Singh, M. Organization of physical interactomes as uncovered by network schemas. PLoS Comput. Biol. 4, e1000203 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  133. Banks, E., Nabieva, E., Peterson, R. & Singh, M. NetGrep: fast network schema searches in interactomes. Genome Biol. 9, R138 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  134. Singh, R., Xu, J. & Berger, B. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc. Natl Acad. Sci. USA 105, 12763–12768 (2008). This paper introduces global network alignment and pioneers the use of spectral methods to solve it. Led to IsoBase, a database of functionally related proteins across protein-protein, genetic interaction and metabolic networks, simultaneously incorporating both sequence and network data.

    Article  PubMed  PubMed Central  Google Scholar 

  135. Liao, C. S., Lu, K., Baym, M., Singh, R. & Berger, B. IsoRankN: spectral methods for global alignment of multiple protein networks. Bioinformatics 25, i253–i258 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  136. Flannick, J., Novak, A. Srinivasan, B. S., McAdams, H. H. & Batzoglou, S. Graemlin: general and robust alignment of multiple large interaction networks. Genome Res. 16, 1169–1181 (2006).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  137. Koyuturk, M. et al. Pairwise alignment of protein interaction networks. J. Comput. Biol. 13, 182–199 (2006).

    Article  PubMed  Google Scholar 

  138. Kelley, B. P. et al. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl Acad. Sci. USA 100, 11394–11399 (2003).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  139. Atias, N. & Sharan, R. Comparative analysis of protein networks: hard problems, practical solutions. Commun. Acm 55, 88–97 (2012).

    Article  Google Scholar 

  140. Park, D., Singh, R., Baym, M., Liao, C. S. & Berger, B. IsoBase: a database of functionally related proteins across PPI networks. Nucleic Acids Res. 39, D295–D300 (2011).

    Article  CAS  PubMed  Google Scholar 

  141. Ma, C.-Y. et al. Reconstruction of phyletic trees by global alignment of multiple metabolic networks. BMC Bioinformatics (in the press).

  142. Goh, K. I. et al. The human disease network. Proc. Natl Acad. Sci. USA 104, 8685–8690 (2007).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  143. Rossin, E. J. et al. Proteins encoded in genomic regions associated with immune-mediated disease physically interact and suggest underlying biology. PLoS Genet. 7, e1001273 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  144. Navlakha, S. & Kingsford, C. The power of protein interaction networks for associating genes with diseases. Bioinformatics 26, 1057–1063 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  145. Vanunu, O., Magger, O., Ruppin, E., Shlomi, T. & Sharan, R. Associating genes and protein complexes with disease via network propagation. PLoS Comput. Biol. 6, e1000641 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  146. Kohler, S., Bauer, S., Horn, D. & Robinson, P. N. Walking the interactome for prioritization of candidate disease genes. Am. J. Hum. Genet. 82, 949–958 (2008). This paper introduces random-walk based approaches for prioritizing disease genes using interaction networks.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  147. Erten, S., Bebek, G., Ewing, R. M. & Koyuturk, M. DADA: degree-aware algorithms for network-based disease gene prioritization. BioData Min. 4, 19 (2011).

    Article  PubMed  PubMed Central  Google Scholar 

  148. Vandin, F., Upfal, E. & Raphael, B. J. Algorithms for detecting significantly mutated pathways in cancer. J. Comput. Biol. 18, 507–522 (2011). The authors develop a flow-based and statistical approach for analysing genes mutated in cancers within their network context in order to identify significantly mutated subnetworks.

    Article  CAS  PubMed  Google Scholar 

  149. Cerami, E., Demir, E., Schultz, N., Taylor, B. S. & Sander, C. Automated network analysis identifies core pathways in glioblastoma. PLoS ONE 5, e8918 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  150. Kumar, P., Henikoff, S. & Ng, P. C. Predicting the effects of coding non-synonymous variants on protein function using the SIFT algorithm. Nature Protoc. 4, 1073–1081 (2009).

    Article  CAS  Google Scholar 

  151. Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nature Methods 7, 248–249 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  152. Yandell, M. et al. A probabilistic disease-gene finder for personal genomes. Genome Res. 21, 1529–1542 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  153. Vandin, F., Upfal, E. & Raphael, B. J. De novo discovery of mutated driver pathways in cancer. Genome Res. 22, 375–385 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  154. Chowdhury, S. A. & Koyuturk, M. Identification of coordinately dysregulated subnetworks in complex phenotypes. Pac. Symp. Biocomput. 2010, 133–144 (2010).

    Google Scholar 

  155. Ulitsky, I., Krishnamurthy, A., Karp, R. M. & Shamir, R. DEGAS: de novo discovery of dysregulated pathways in human diseases. PLoS ONE 5, e13367 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  156. Cho, D.-Y., Kim, Y.-A. & Przytycka, T. M. Network biology approach to complex diseases. PLoS Comput. Biol. (in the press).

  157. Wang, Z., Gerstein, M. & Snyder, M. RNA-seq: a revolutionary tool for transcriptomics. Nature Rev. Genet. 10, 57–63 (2009).

    Article  CAS  PubMed  Google Scholar 

  158. Furey, T. S. ChIP-seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions. Nature Rev. Genet. 13, 840–852 (2012).

    Article  CAS  PubMed  Google Scholar 

  159. Hafner, M., Lianoglou, S., Tuschl, T. & Betel, D. Genome-wide identification of miRNA targets by PAR-CLIP. Methods 58, 94–105 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  160. Wang, E. T. et al. Transcriptome-wide regulation of pre-mRNA splicing and mRNA localization by muscleblind proteins. Cell 150, 710–724 (2012).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  161. Ascano, M., Hafner, M., Cekan, P., Gerstberger, S. & Tuschl, T. Identification of RNA-protein interaction networks using PAR-CLIP. Wiley Interdiscip. Rev. RNA 3, 159–177 (2012).

    Article  CAS  PubMed  Google Scholar 

  162. Jungkamp, A. C. et al. In vivo and transcriptome-wide identification of RNA binding protein target sites. Mol. Cell 44, 828–840 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  163. Ingolia, N. T., Ghaemmaghami, S., Newman, J. R. & Weissman, J. S. Genome-wide analysis in vivo of translation with nucleotide resolution using ribosome profiling. Science 324, 218–223 (2009).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  164. Meyer, L. R. et al. The UCSC Genome Browser database: extensions and updates 2013. Nucleic Acids Res. 41, D64–D69 (2013).

    Article  CAS  PubMed  Google Scholar 

  165. de Souza, N. The ENCODE project. Nature Methods 9, 1046–1046 (2012).

    Article  CAS  PubMed  Google Scholar 

  166. Gerstein, M. B. et al. Integrative analysis of the Caenorhabditis elegans genome by the modENCODE Project. Science 330, 1775–1787 (2010).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  167. Manber, U. & Myers, G. Suffix Arrays — a new method for online string searches. Siam J. Comput. 22, 935–948 (1993).

    Article  Google Scholar 

  168. Li, R., Li, Y., Kristiansen, K. & Wang, J. SOAP: short oligonucleotide alignment program. Bioinformatics 24, 713–714 (2008).

    Article  CAS  PubMed  Google Scholar 

  169. Paten, B. et al. Cactus: algorithms for genome multiple sequence alignment. Genome Res. 21, 1512–1528 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors thank and L. Cowen for valuable feedback. B.B. thanks the US National Institutes of Health (NIH) for grant GM081871. M.S. thanks the NIH for grant GM076275 and US National Science Foundation (NSF) for grant ABI0850063.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bonnie Berger.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Related links

PowerPoint slides

Glossary

Cloud computing

The use of computing resources distributed in the Internet to store, manage and analyse data, rather than doing so on a local server or personal computer.

Parallel computing

A form of computation that allows numerous calculations to be carried out simultaneously, thereby accelerating computation. On the basis of this principle, many large-scale computational tasks can then be divided into smaller ones and solved on multiple machines concurrently.

Machine learning techniques

Empirical data are taken as input, the relationship among the data is mathematically or statistically modelled, and patterns or predictions are generated. Supervised learning algorithms infer a function from labelled data features and predict labels on future input; unsupervised learning algorithms model the patterns or the distribution of a given unlabelled data set.

Parallel dynamic programming

A technique that splits a large dynamic programming problem, usually by filling a table that can avoid redundant calculation, into a number of subproblems and computes all subproblems in parallel using multiple central processing units (CPUs). The computing speed-up scales almost linearly with the number of CPUs.

Multicore computer processing units

(Multicore CPUs). Single computing processors with two or more independent computing units (called cores). Running multiple instructions on multiple cores at the same time can increase the overall speed of programs.

Cache-oblivious algorithm

Takes advantage of the cache system of the central processing unit (that is, the local memory of frequently accessed data) to avoid expensive memory access operations and thus to improve efficiency; the intrinsic design of these algorithms does not require computer programs to be tuned for machines with different cache systems.

Linear mixed model

A statistical model that models the observed effects from multiple different hidden factors; the effects are additively mixed according to the proportions of their corresponding factors.

Matrix factorization

A method for decomposing a matrix into the product of two matrices. It can be applied to identify individual factors involved in a mixed observation.

Differential geometry

A mathematical discipline for studying geometric objects, such as curves and surfaces, using the techniques of differential and integral calculus.

Linear programming

A mathematical program for the optimization of a linear objective function, subject to linear constraints. Such functions capture the linear relationship between variables for the problem being optimized.

Principle component analysis

A tool for transforming a set of observations with correlated variables into a set of linearly independent variables called principle components, making sure that the first principle component accounts for the largest variability of the data.

Copy number variant

(CNV). Corresponds to abnormal number of copies of one or more segments in the genome. CNVs can be caused by structural rearrangements of the genome such as deletions, duplications, inversions and translocations.

Bayesian network

A statistical model that describes the distribution of a set of random variables by a directed acyclic graph that represents the relationship among the random variables. For example, in a Bayesian network for a regulatory relationship for a set of genes, each variable represents a gene and each directed edge denotes either activating or repressing regulation between two genes.

Steiner tree problem

Formulated on a network to find a minimum-length subnetwork that interconnects a set of seed nodes. Any two seed nodes may be connected by an edge or a path through other nodes.

Random walk

A mathematical formulation of a number of successive random steps on a graph. It has been widely used to explain stochastic observations, such as diffusion in biological networks.

Eigenvalue problem

The aim of this is to find a non-zero vector (that is, eigenvector), given a square matrix, such that the multiplication of the two is only different by a scalar factor.

Set cover

Given a set of elements and subsets, the goal is to find the minimum number of subsets that cover all the elements.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Berger, B., Peng, J. & Singh, M. Computational solutions for omics data. Nat Rev Genet 14, 333–346 (2013). https://doi.org/10.1038/nrg3433

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nrg3433

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research