Introduction

IBD comprises two disabling immune-mediated conditions: ulcerative colitis and Crohn’s disease1,2. Similar to other chronic, non-infectious diseases, IBD has been classified as a prototypical complex disease3,4,5, in which biological complexity arises from intricate interactions between multiple factors, such as genes, environment, microbiota and diet, among others.

During the past 20 years, major advances have been made in understanding components of IBD physiopathology, which have subsequently led to increased therapeutic options with the development of biologics and small molecule drugs engaging different targets6. Increases in IBD incidence and prevalence are observed worldwide but are particularly pronounced in developing countries, and this trend is expected to continue in the coming years7,8. This growing IBD burden will probably exacerbate current issues such as health-related costs and access to care.

Despite important breakthroughs in the past two decades, the complexity of IBD creates enormous challenges, and traditional scientific methods have been unable to address important research questions, which manifests as unmet clinical management needs9,10. The current paradigm of research in IBD has led to many frustrating results, and innovative methods are required to help disentangle disease complexity, which will ultimately translate into better patient care.

Theoretically, the integration of a wealth of omics data with clinical information and information on factors such as lifestyle, diet and environmental exposures could enable three major unmet clinical needs to be addressed: the identification of biomarkers that enable the early and unambiguous identification of patients with IBD before the full clinical picture has unfolded, thereby allowing very early treatment initiation; the stratification of patients by their predicted response to different drugs; and the stratification of patients by predicted disease course, which might inform the use of more or less aggressive treatment approaches.

In this Review, we explore potential applications of big data in IBD research, such as predictive models of disease course and response to therapy, characterization of disease heterogeneity, drug safety and development, precision medicine and cost-effectiveness of care. We also discuss the strengths and limitations of potential data sources that big data analytics could draw from in the field of IBD.

Big data

The increasing generation and availability of digital data in every aspect of life, coupled with enhanced analytical capability owing to advances in computational science, have produced new insights used to improve outcomes in many disciplines, notably in finance and social media. Technology giants like Google, Amazon, Facebook and Apple have successfully used big data approaches to improve sales, boost efficiency and increase earnings11,12. Political campaigns and government agencies have also used large data sets of information produced by citizens to develop models that guide successful electoral strategies13.

Until a few years ago, the health-care sector had not substantially explored the potential benefits of big data14. Wide-spread implementation of big data analyses in health care is eagerly awaited because it has the potential to greatly improve many areas of care15. Several promising applications of big data in health care exist: better understanding of disease pathogenesis and classification of complex diseases; development of predictive prognostic models; reduction of risks; identification of predictive events to support prevention initiatives; improvement of health-care cost-effectiveness; and personalization of therapeutic regimens16,17,18,19.

In an often-cited example of big data in health care, a paper published in 2009 reported the development of an algorithm using Google search queries to track influenza-like illnesses in the USA20. By monitoring and analysing the health-seeking behaviour of millions of users in the form of queries to online search engines, the appealing promise of this Google model was to predict influenza activity more rapidly than the US Centers for Disease Control and Prevention (CDC) model20. However, the model missed the first wave of the influenza A (H1N1) pandemic outbreak in 2009. Furthermore, it proved to be rather inaccurate: the Google model overestimated the number of medical visits for influenza-like illness by twofold compared with the CDC model21,22. Influenza prevalence estimates by Google are no longer published, providing an example that represents a lesson of the possible challenges ahead.

However, in a successful example published in 2016, a study explored the risk of Parkinson disease using big data methodologies by combining multiple sources of diverse data, including neuroimaging, genetic, clinical and demographic data, contained in the Parkinson disease Progression Markers Initiative archive23. Model-free big data machine-learning-based classification methods could predict Parkinson disease with accuracy, sensitivity and specificity consistently exceeding 96%23.

The potential of big data in health care has been acknowledged by the US NIH. In 2013, the Big Data to Knowledge (BD2K) initiative was launched to support the research and development of innovative and transforming approaches and tools to maximize and accelerate the integration of big data and data science into biomedical research24. Owing to the intrinsic characteristics of IBD and the management dilemmas that it imposes, the implementation of big data research strategies not only can complement current research efforts but also could represent the only way to overcome the complexity of the disease.

Defining big data

Although an exact and universally accepted definition of big data does not exist, the concept refers to sets of data with a scale and complexity that enforces the use of dedicated analytical and statistical approaches19,25. In the specific case of biomedicine, big data include large-volume and high-diversity biological, genetic, clinical, environmental and lifestyle information collected from single individuals as well as large cohorts in relation to their disease and/or wellness status at one or several time points26.

Distinctive attributes of big data include the four Vs: volume, variety, velocity and veracity12,18,27. The first and most obvious characteristic of big data is volume, namely, the large amount of data in a data set. Health-related data are created and accumulated continuously, and they are expected to continue to grow dramatically up to an almost unconceivable extent. The volume of health-care data was calculated at 153 exabytes in 2014, and at the projected growth rate of 48% a year, that figure is estimated to reach ~2,300 exabytes by 2020 (refs27,28). This very large amount of data arises from the combination of multiple sources of structured data (for instance, administrative databases) and unstructured data (such as clinical notes), which in fact represent the second characteristic of big data: variety17. The third characteristic is velocity, which reflects the speed at which such information is created and accumulated. Speed is also essential to combine and analyse large and diverse data sets rapidly enough to yield valuable information to make decisions18. The final characteristic of big data — which is crucial in health-care informatics — is veracity27. Veracity means that the big data, its analytics and its outcomes provide a faithful representation of the subject under investigation as well as of the distribution of a complex phenomenon in the population. In other words, such data are expected to be unbiased and therefore intrinsically without errors and credible (although the outcome of their analyses might be affected by several technical factors)16,18. This characteristic is of utmost importance to reliably translate medical big data into clinical decisions. Health-care data can comprise various sources of highly variable quality, especially when considering unstructured data. Hence, veracity frequently represents a goal rather than reality18.

Big data analytics can receive multiple inputs or data sources (Fig. 1). Theoretically, the variety of these data sources is not restrained. Currently, the most important data sources for medical big data include but are not limited to administrative databases, clinical trials registries, epidemiological studies, electronic medical records, biometric data, patient-reported health data, medical images, biomarker data, omics data (that is, genomics, proteomics and metabolomics data sets), data from social media and the internet17,29. The variety of potential data sources is expected to continue to grow, although identifying and linking together the sources that will add value and new insights represent a challenge30.

Fig. 1: Overview of big data in IBD.
figure 1

Big data analytics in IBD research could be fed from multiple potential data sources (or inputs). Raw data from these inputs (both structured and unstructured data) need to be extracted and transformed or processed to be readily usable and stored. Big data platforms (such as Hadoop, MapReduce, Big Table, and so on) are used to organize, integrate and analyse these large volumes of data. Different analytical methods can be used, ranging from traditional statistical methods (such as regression) to advanced methods (including data mining, machine learning, clustering, text analysis and image analytics). The models developed (outputs) can then be used in different applications that might add value to current disease knowledge. CESAME, Cancer and Increased Risk Associated with Inflammatory Bowel Disease in France; GPRD, General Practice Research Database; PIANO, Pregnancy in IBD Neonatal Outcomes; SNIIRAM, Système National d’Information InterRégimes de l’Assurance Maladie; SWIBreg, Swedish Quality Register for IBD.

Computer sciences have produced remarkable advances not only in hardware capacity but also in the development of software analytical platforms that enable large and diverse data sets to be handled and analysed17. Big data analyses use computational approaches, such as data mining and machine-learning algorithms, to extract information from a data set and to identify patterns generated by sets of features associated with disease risk, prognosis or response to therapy11 (Fig. 1). Importantly, in most cases, these approaches return hypothesis-free predictive models, without a clear explanation of the outcome (for example, in weather forecasts, the accuracy of the prediction is important, not the complete understanding of the underlying causes). This approach contrasts with traditional hypothesis-driven scientific method research, in which hypotheses are formulated on the basis of observations, followed by design and execution of experiments and then validation of results, which ultimately leads to acceptance or rejection of the hypothesis31. Description of these platforms and analytical methods is beyond the scope of this Review. Analysis of health-care big data is an opportunity to discover new patterns, associations and trends that ultimately improve patient care and disease outcomes and reduce health-related costs18.

Why do we need big data in IBD?

Disease heterogeneity

IBD has been arbitrarily divided into Crohn’s disease and ulcerative colitis on the basis of descriptive characteristics, with the terms ‘indeterminate colitis’ or ‘IBD unclassified’ used when distinction is not possible32. IBDs are heterogeneous diseases in which a wide range of clinical phenotypes are possible, regarding not only disease location and behaviour but also age of onset, severity of symptoms, association with other immune-mediated conditions, extraintestinal manifestations, complications, response to therapy, need for surgery, and so on33,34,35. Moreover, the effect of the disease on the patient, disease burden and disease course should also be taken into account to correctly classify the disease36. Better classification of IBD into distinct phenotypes will not only lead to better understanding of the disease but might also help identify particular subgroups of patients that would benefit from particular interventions.

Big data approaches to disease heterogeneity might help identify these phenotypes; the hypothesis-free nature of data mining and other methodologies takes into consideration a large number of variables from multiple sources. Some studies that have used big data methods to define distinct groups of patients (so-called phenomapping) are already available, especially in the fields of oncology, cardiology and diabetes37,38.

Predictive models

Evidence suggests that early introduction of intensive treatment (that is, combination therapy of biologics and immunosuppressors) in Crohn’s disease leads to better outcomes and might be associated with a disease-modifying effect, reducing complications, need for surgery and hospitalizations39,40,41. Features associated with a high risk of an aggressive disease course include perianal disease, ileocolonic location, young age at diagnosis and need for steroids to treat the first flare. However, many patients possess these factors, and they might not be accurate predictors of a severe disease course42,43. In ulcerative colitis, factors such as extensive disease, need for systemic corticosteroid therapy at disease onset, young age, extraintestinal manifestations and biochemical parameters were also associated with a more aggressive disease course44. Given the potential risks and costs of therapy, defining reliable risk factors and predictive models for severe or complicated disease course in IBD is of paramount importance.

Currently, one the most common uses in health care for big data methodologies is to develop predictive models that identify high-risk or high-cost patients45, for instance, by including previously unconsidered variables and other difficult-to-handle or complex information such as omics data.

Precision medicine

The IBD therapeutic pipeline has expanded dramatically in the past decade, and several new biologic and small molecule compounds are expected to be available in the next few years6. To rationally use these therapeutic resources, it will be crucial to develop biomarkers that reliably identify which patients would benefit, or be harmed, by a particular drug46. Efforts have been made to predict response to anti-TNF therapy on the basis of clinical information (such as disease duration, phenotype and smoking status) from retrospective studies and post hoc analyses of clinical trials47,48, as well as from the study of TNF gene polymorphisms49,50, but results are inconsistent, and there is a paucity of tools to predict anti-TNF response in clinical practice51.

In light of this lack of success, tailored therapy for a given patient will probably need input not only from clinical and laboratory information but also from complex omics data. Integration of these multiple data sources in big data studies will therefore be of utmost importance for the development of precision medicine in IBD.

Drug safety

The introduction of new therapies always brings safety concerns, as randomized controlled trials are usually underpowered to detect very infrequent but clinically relevant adverse events. Additionally, such adverse events usually take years or even decades to occur (as in the case of malignancy), beyond the follow-up period of most clinical trials. Currently, the field relies on post-marketing studies, such as the Cancer and Increased Risk Associated with Inflammatory Bowel Disease in France (CESAME)52 or IBD Cancer and Serious Infections in Europe (I-CARE)53 studies, but these registries are costly, very time consuming and usually take several years from drug release to develop the full picture of the safety profile of a drug.

By simultaneously evaluating multiple sources of diverse information, big data approaches have the potential to rapidly detect safety signals before currently available tools. Implementation of these techniques applied to drug safety and detection of adverse events is starting to be explored54,55,56. For instance, pharmacovigilance can be improved using text mining, a computational process in which meaningful information is extracted from unstructured textual data sources, to obtain data on adverse drug events from medical notes54.

Epidemiology and public health

IBD has become a global disease in the past few decades. In developed countries, prevalence is increasing, although the incidence is stable8. On the other hand, incidence of IBD in newly industrialized countries has increased steeply, a phenomenon also seen in developing countries with westernization of lifestyle57. With this changing epidemiological scenario, the disparity of care across countries will probably be exacerbated58. Studies using big data methodologies could help design models that predict health-care utilization to better allocate resources59,60. For instance, Sebaa et al. used a Hadoop platform to model equitable health resource allocation in the Béjaïa region in Algeria59.

Additionally, health-care costs are rapidly increasing worldwide and in the case of IBD are mainly driven by biologic medication costs61. In this context, big data research can help improve cost-effectiveness in IBD by correctly identifying patients at risk of an aggressive disease course and those who will benefit from a particular drug at given time of disease.

Drug discovery and development

Although the past decade has seen the IBD pipeline expand markedly, some issues in drug research and development (R&D) still need optimization. The R&D process for new drugs is a very expensive endeavour, ranging from ~US$3 billion to more than $30 billion per approval62. Moreover, some compounds prove to be ineffective or even harmful only at late stages of development, wasting great amounts of time and resources and putting individuals at risk. For instance, the antisense oligonucleotide mongersen showed extremely positive effects in a phase II trial in Crohn’s disease63, but the phase III programme was terminated due to futility64. In another example, secukinumab, a fully human anti-IL-17A monoclonal antibody, was found to be ineffective, and higher rates of adverse events were noted in the treatment group than in placebo group, despite animal models and genome-wide association studies (GWAS) suggesting a role of IL-17 in Crohn’s disease65. Tofacitinib, a Janus kinase inhibitor, has also shown inconsistent results in patients with Crohn’s disease despite being effective in those with ulcerative colitis46.

Big data analytics have the potential to improve cost-effectiveness and reduce drug discovery and development times16,66. By linking omics data with clinically relevant data from multiple sources, these methods might help prioritize drug targets, mechanisms of action and target populations67. Currently, clinical trials need to recruit thousands of patients to develop a drug, and very frequently, clinical trial results show remarkable variability in responses to a given drug across the studied population. This variability can be explained by omics diversity and phenotypical heterogeneity of the patient population, which can be overturned by the use of big data68.

Furthermore, big data could be used for repurposing already approved drugs for other indications16,69,70. In one example, Dudley et al.71 applied a computational approach to discover potential new drug therapies for IBD in silico. They compared gene expression profiles from human cell lines treated with 164 different small molecule compounds with publicly available gene expression measurements and data from a previously published study that evaluated Crohn’s disease and ulcerative colitis in human intestinal tissue obtained by biopsy72. They predicted that the anticonvulsant topiramate would have therapeutic activity in IBD and experimentally validated this finding in vivo in a mouse model72. Nevertheless, in a large retrospective cohort study, topiramate use was not associated with a reduction in steroid use, need for anti-TNF agents, surgery or hospitalizations73, and the drug has not been further investigated in IBD.

Sources of big data in IBD

Administrative databases

Administrative databases are the most straightforward sources to acquire data from for big data research in IBD. Many countries have developed large databases for storing data that are routinely collected during clinic, hospital, laboratory or pharmacy visits74. Although most of these databases were initially designed for reimbursement of health-care services, they have been extensively used for epidemiological, effectiveness and safety outcome studies74.

The French SNIIRAM (Système National d’Information InterRégimes de l’Assurance Maladie) linked with the PMSI (Programme de Médicalisation des Systèmes d’Information) is possibly the world’s largest continuous homogeneous claims database75. This database includes individual medical and sociodemographic information from all hospital care and outpatient medicine reimbursements of 98.8% of the population living in France (~66 million people) from birth (or immigration) to death (or emigration)75,76,77,78. The value of this system has been demonstrated in numerous publications, ranging from epidemiological to pharmacoeconomical studies78, including those in IBD79,80.

Another European example of a successful administrative database is the British GPRD (General Practice Research Database), a computerized database of anonymized patient data collected continuously since 1987 (ref.81). This system contains information on ~4.8 million patients in the United Kingdom, equivalent to ~7% of the population, collected from >600 general practices81. The GPRD has proved to be reliable for IBD studies, although it can be difficult to extract relevant information, such as date of incident diagnoses, hospitalizations and surgeries, owing to incomplete records82.

The Swedish NPR (National Patient Register) was established in 1964 and achieved virtually universal coverage in 2001, when data on specialized hospital-based outpatient care were added83. The NPR contains data on diagnoses and procedure codes. The Swedish Quality Register for IBD (SWIBreg), established in 2005, contains clinical data that are either missing or lacking in detail in the NPR and covers ~50% of the country’s IBD population84. Diagnoses of IBD in both the NPR and the SWIBreg have been well validated for use in clinical studies85. Notably, many countries across the world have implemented similar databases that enable epidemiological research86,87,88.

In the USA, the collection of health data is separated between multiple administrative databases according to specific age or income groups (Medicare and Medicaid services, respectively)89, profession (for instance, Veterans Affairs)90 or members of private insurance plans. Often, linkage between different databases or long-term follow-up is not possible. In an effort to homogenize data, a growing number of states have established databases that collect insurance claims information from all health-care payers into all-payer claims databases91,92, and many other states are considering such a law or programme91.

Electronic health records

Adoption of electronic health records (EHRs) varies greatly across countries, although rates have been increasing worldwide, and some countries have moved entirely to EHRs26. Massive amounts of data are generated and accumulated simply as a by-product of medical attention.

In the USA, physicians have been encouraged to use EHRs since the legislation Health Insurance Portability and Accountability Act was passed in 1996 with the intention to detect insurance fraud93, but implementation of EHRs varies widely. Adoption of EHRs also varies in Europe, with countries such as Estonia and the Netherlands reaching almost complete coverage26.

Typically, EHRs include both structured and unstructured data94. Structured data account for approximately one-fifth of available information and exist in the form of patient demographics, diagnosis codes, laboratory data, vital signs and similar material. Structured data can be easily stored, analysed and manipulated18. However, the vast majority of information in EHRs is unstructured in the form of narrative medical notes95; hence, pre-processing of data and computer-based methods such as natural language processing (NLP) are essential to organize, interpret and recognize patterns from these data94. In the past 5 years, adoption of NLP in EHR-based research for various purposes, for instance, pharmacovigilance and phenotyping, has grown markedly96,97. The performance of NLP has improved greatly and will continue to improve as the number of data sources and their volumes grow96.

By using data from EHRs, Waljee et al.98 developed a machine-learning algorithm to predict remission in patients with IBD treated with thiopurines and investigated whether achieving algorithm-predicted remission resulted in fewer clinical events (defined by steroid use, hospitalization or surgery). The algorithm outperformed circulating levels of 6-thioguanine nucleotide in predicting remission (area under the receiver operating characteristic 0.79 versus 0.49), and an algorithm-predicted remission was associated with fewer clinical events per year (1.08 versus 3.95; P < 1 × 10−5)98. Limitations of this algorithm include the use of retrospective data and a single-centre population in its development, and these results should be validated in prospective trials.

In a study published in 2018, Cai et al. performed a retrospective analysis using NLP to identify arthralgia in the EHR clinical notes from two tertiary hospitals and to compare the risk of arthralgia between patients with IBD receiving vedolizumab and those receiving anti-TNF agents99. They found no increased risk of arthralgia associated with vedolizumab use (HR 1.20, 95% CI 0.97–1.49)99.

Clinical trials and epidemiological studies

Landmark clinical trials have shaped current treatment paradigms in IBD. Moreover, post hoc analyses of these trials have revealed valuable findings, such as the importance of mucosal healing, deep remission and histological remission in disease management. These analyses were mainly reserved for the primary researchers and sponsors; however, there is increasing interest in the need for open-access sharing of data from clinical trials100. In 2016, the International Committee of Medical Journal Editors proposed to require authors of clinical trials to share publicly with others the de-identified individual patient data underlying the results presented in the article no later than 6 months after publication to increase the study reproducibility and to facilitate secondary analyses by external investigators101. Several factors might hamper the availability of these data, such as intellectual property, fears of different conclusions, confidentiality concerns and lack of resources102. Beyond these difficulties, many pharmaceutical sponsors have already created mechanisms for investigators to access patient-level clinical trial data in multiple diseases (including IBD) through open-access platforms103. Although the policies by which trials are included in these platforms vary between companies, most include all trials within certain date ranges after regulatory review and publication of results103. In an interesting example of how these platforms could enable subsequent analyses, Waljee et al. obtained clinical data from the induction and maintenance phase III trial of vedolizumab in ulcerative colitis (GEMINI 1) via the Clinical Study Data Request open-access platform104. They then applied machine-learning tools to develop predictive models of corticosteroid-free endoscopic remission in response to vedolizumab105. Although open data platforms are an opportunity for research, with data available from >3,000 trials, they are underutilized: only 15.7% of trial data sets had been requested by a limited number of researchers as of 2016 (ref.103).

Epidemiological studies such as the IBSEN study and the CESAME study have also greatly contributed to the understanding of IBD, especially regarding natural history and safety of interventions52,106,107,108. Examples of future cohort studies include the I-CARE (NCT02377258, which will look deeper into the risk of malignancy and infections)53,109 and the PREdiCCt studies (NCT03282903)110. For instance, in the PREdiCCT study, patient-generated data on clinical symptoms, diet and lifestyle gathered through a mobile application110 will be integrated with genomic and microbiota data in a multisource input paradigm to study the effects of these factors on IBD flares and recovery110.

The main strength of the information gathered in clinical trials and cohort studies for big data analytics is its high quality and consistency, whereas the availability of data represents the main limitation. In turn, big data approaches might help the design of both interventional and observational clinical studies, such as by improving trial designs, tailoring patient selection, boosting recruitment and lowering costs111,112.

Mobile applications, e-health and social media

During the past two decades, a remarkable shift has occurred towards the digitalization of daily life. The internet and mobile technologies are present in almost every aspect of life, with social media having a preponderant role113. The ‘read-only’ World Wide Web environment has evolved to Web 2.0, characterized by multidirectional communication in which individuals produce, participate, modify and collaborate with user-generated content114,115. These digital interactions lead to the accumulation of an enormous amount of data. e-Health tools and telemedicine (defined as diagnosis, treatment and monitoring of disease at a distance, especially by means of the internet, mobile phone applications and wearable devices) not only arise as a consequence of this context but might also be an opportunity to facilitate self-management and reduce health-care utilization116,117.

The effect of e-health in IBD has been studied in a few clinical trials with dissimilar results118,119,120,121. Whereas earlier trials showed the value of these strategies only in patients with ulcerative colitis (mainly those with mild to moderate disease)122,123, a large randomized controlled trial conducted in Netherlands and published in 2017 demonstrated that a telemedicine system through a web-based and smartphone application was efficacious in all subtypes of IBD119. Those in the intervention group had reduced use of health-care services (number of outpatient visits and hospital admissions) and increased treatment adherence compared with patients in the standard care group119. However, another randomized controlled trial published in 2018 showed no differences in disease activity and quality of life between telemedicine and standard of care groups after 1 year; telemedicine was associated with a decrease in hospitalizations but also with an overall increase in health-care utilization124. A comprehensive telemedicine system in IBD should include not only patient-reported outcome data but also objective markers of inflammation125. In this regard, faecal calprotectin levels measured using a home-based test linked to a smartphone application showed good correlation with levels determined by laboratory-based enzyme-linked immunosorbent assay (ELISA) analysis126,127. The implementation of e-health, and its use as a source of big data analytics will surely face challenges, especially regarding data privacy, security and legal ownership125.

Social media can also serve as a data source that offers particular opportunities to gain new insight on health-seeking behaviour, epidemiological trends and patients’ perspectives of disease and treatments128. For instance, a study published in 2017 used a netnography analysis — a method to understand social interactions in the context of contemporary social networks — to evaluate posts from Twitter and >3,000 social media sites to reveal patients’ experience and choice of biologics in IBD129. They examined 1,598 IBD-related posts and found that the main themes of interaction were negative experiences with biologics, decision-making surrounding biologic use, positive experiences with biologics, information-seeking from peers and costs129.

Medical imaging

Imaging techniques, particularly MRI, CT and ultrasonography, are increasingly used as diagnostic tools and non-invasive objective measures of inflammation in IBD130. As these techniques become widely available and cloud systems are used to digitally store and process these imaging study findings, the volume of data in the form of medical images will continue to grow exponentially131. Application of big data methodologies in the field of medical imaging has the potential to enhance pattern recognition of lesions to have more accurate interpretation of results. Big data can also help determine which patients will have a better diagnostic yield for a given imaging technique132,133. Challenges of its use include the difficulty of comparing images obtained using different techniques and integration of imaging data with other sources.

Genomics, proteomics, metabolomics and microbiomics

GWAS have identified multiple loci associated with increased risk of IBD134,135,136. Moreover, high-resolution genetic studies have identified within these loci the specific single nucleotide variants (SNVs) responsible for the increase in IBD risk137, although the underlying mechanisms linking individual SNVs to disease risk are still unclear. Some genetic variants proved to be associated with distinct disease phenotypes, such as NOD2 gene mutations in fibrostenotic Crohn’s disease138. However, as a general rule, most genetic variants have a rather small effect on overall disease risk, prognosis or response to therapy, implying that genetic variants are by themselves not predictive and that most people carrying a high-risk variant will never develop the disease139,140. Moreover, most of these risk variants are shared with other chronic inflammatory diseases, such as mutations in IL23R in ankylosing spondylitis and psoriasis, and mutations in NOD2 in mycobacterial disease141, which indicates that, although they might contribute to an overall increase in inflammatory disease risk, they do not dictate organ specificity142,143. Overall, these data imply that the phenotypic effect of genetic variants is modulated by a plethora of non-genetic factors, which probably include the diet as well as the composition and diversity of the intestinal microbiome144. The role of these additional factors imposes the need to integrate data from GWAS with data from other omics approaches, such as those investigating changes in gene expression and the accessibility and usage of the genome (for example, changes in DNA methylation) in both intestinal and immune cells. Efforts in this direction are now being carried out worldwide in large-scale consortia projects, such as the Systems Medicine Approach to Chronic Inflammatory Diseases (SYSCID) consortium145.

Owing to the rapid generation of enormous amounts of omics data in the past decade, problems related to storage, analysis, integration and interpretation have arisen146,147 that have largely been solved by computational techniques using algorithmic frameworks that are adaptable to large-scale omics data148. It is now clear that bioinformatics and computational sciences are essential to adequately manage and integrate data from these components and other sources3,149,150.

Conclusions

Most aspects of life have become increasingly digitized over the past few years. Data are generated and accumulated simply as a by-product, and the health-care sector is no exception to this fact. Enormous amounts of data are generated through various sources, such as EHRs, administrative databases, clinical trials, registries, social media and omics techniques. Big data studies are an important opportunity to leverage these underutilized data sources and to gain new insights that ultimately lead to better understanding of IBD and fill the gaps in patient care.

IBD research has seen great advances, although clearly, there are many unmet needs (Box 1). Currently, the most potent biologic treatments benefit roughly half of patients at most, and complications, impaired quality of life, hospitalizations and surgeries are still common. Despite the introduction of biosimilar agents, treatment-related costs are still very high, and in the context of increasing incidence in low-income and middle-income countries, improved IBD care cost-effectiveness is an important goal.

Implementation of big data methodologies in IBD research is very promising (Table 1), but it must be remembered that these research strategies are at early stages in health care in general. Even in pioneer disciplines in the field, such as oncology and cardiology, the reports are scant, and the added value of big data remains to be seen. Lack of direct evidence and the disappointing results of initial studies (such as the aforementioned Google influenza model) urge caution.

Table 1 Examples of big data studies in IBD

Researchers will face several limitations and challenges with the implementation of big data approaches in IBD (Box 2). First, the quality of data across different sources will inherently be heterogeneous, with some sources (for instance, social media or even unstructured information in EHRs) especially prone to poor quality151. Big data approaches can be performed with poor data quality inputs, which can detrimentally affect the accuracy and clinical utility of the output18. Identification and selection of correct and adequate-quality sources represent important challenges to achieve a critical characteristic of big data in health care: veracity. Second, the availability of data faces ethical and legal constraints related to patient privacy and consent to share individual data. Although, personal information is de-identified when data are analysed, the possibility of recognizing individuals still exists152. Third, predictions and models made by computational methods must still be thoroughly validated experimentally and clinically before general use16, as poorly validated models might have the potential to harm151. Independent agencies must oversee and certify commercial profit-driven initiatives that intend to be used in clinical practice151. Fourth, to potentially improve disease management and outcomes, big data outputs must be integrated into clinical practice, and the question of whether big data models are more effective than traditional risk models remains to be seen17.

Big data research has overcome some of these challenges and proved its value in other fields, such as finance and politics. The era of big data in health care is definitely still in its infancy, but hopefully, IBD research will benefit from its many promises in the coming years.