1 Introduction
The rapid increase in the quantity, diversity and accessibility of digitized patient data has presented unprecedented challenges and opportunities for drug development, regulatory reviews, and healthcare utilization and decision making (Mayer-Schönberger and Cukier
2014; Roski et al.
2014). In contrast to the existing paradigm of drug development that relies on systematically collected numeric data, the new reality involves information that comes in diverse forms and shapes. In this context, Big Data means not only electronic health records, claims data but also data captured through every conceivable medium, including Social Media, Internet search, wearable devices, video streams, and personal genomic services; it may also include data collected from randomized controlled clinical trials (RCTs), particularly when dealing with high dimensional data, including genomic, laboratory, or imaging data.
Arguably, one of the most promising aspects of Big Data in the healthcare arena is its budding role in promoting and advancing research in personalized and precision medicine (Panahiazar et al.
2014; Teli
2014). At the operational level, there is also a significant function for Big Data to enhance the design and conduct of clinical trials, ranging from refining design parameters to identifying patients likely to benefit from experimental medicines. In rare disease research, the accessibility of additional data may have the added advantage of helping fill the gap created by the widely recognized paucity of information (Clarke et al.
2011). Further, there are discernibly important implications for Comparative Effectiveness Research (CER), where the growing need to establish the relative risks and benefits of alternative medical interventions requires evidence base beyond that can be provided by conventional RCTs (Berger and Doban
2014; Gray and Thorpe
2015).
The accompanying developments in methodological procedures and data visualization can also help to improve operational efficiency in the execution of trials, and to tackle complex analytical issues that cannot readily be dealt with using traditional approaches. The potential of these developments to contribute to efforts to reduce costs and to accelerate the delivery of drugs to patients that need them is immeasurable (LaValle et al.
2011).
On the other hand, Big Data in turn poses considerable technical, analytical and ethical challenges. In the face of vast amounts of data, the traditional approaches that rely on transactional database management systems may no longer be satisfactory to link, integrate and process the heterogeneous data emanating from disparate sources
(Hilbert and López
2011). The unprecedented volume of information also needs new computational software and hardware capabilities (Assuncao et al.
2013). Analytically, most traditional approaches break down in the face of highly dimensional data (National Research Council
2013). Furthermore, uncritical use of modern algorithmic tools is likely to lead to unacceptable results with unpredictable consequences (Lazer et al.
2014).
Over and above the technical and analytical challenges, there are also the lingering issues of privacy and confidentiality, and whether the data is good enough to support health policy decision-making (Fhom et al.
2015). Concerning privacy and confidentiality, much work is needed in terms of formulating guidelines to help drug developers understand the current thinking about the extent and nature of evidence from Big Data that would be deemed admissible in the drug approval process (Federal Trade Commission
2010; European Parliament, Council of the European Union
1995). A rationale and pragmatic approach entails a firm understanding of the balance between the data need for medical research and the protection of patient privacy.
With respect to the quality of the data, there is a vibrant debate in the scientific community regarding whether real world data is of sufficient quality for evidence-based medicine. If it’s not, then many believe that the issue of GIGO (e.g., garbage in–garbage out) applies. Many others, including ourselves, argue that although much of real world data is sparse and a lot of the data is “dirty”, with proper analytical, computational and data management tools, it is still useful and can support health policy decision-making.
In this paper, we provide a high-level overview of the challenges and opportunities of Big Data vis-à-vis drug development, with emphasis on the potential for transforming the current paradigm of clinical research and regulatory review, advancing personalized medicine, and protection of the privacy of study participants. The paper is organized as follows. In Sect.
2 we highlight the place of Big Data in evidence-based medicine, including research in rare diseases and personalized medicine. In Sect.
3 we review the implication of the development of new analytical tools in addressing lingering issues in medical research. In the rest of the paper we discuss some of the challenges in incorporating Big Data in clinical development and conclude with suggested recommendations for further work.
4 Challenges with incorporation of RWD in drug approval
4.1 Technical barriers
The growth in volume and complexity of data has required new technological solutions to facilitate the accessibility and linkage of information from the different sources (Hilbert and López
2011). The volume and variety of data necessitates developing highly distributed architectures, introducing increased memory and processing power, and leveraging open-source licensing options (Assuncao et al.
2013). Unlike traditional relational database systems, new platforms, such as Apache Hadoop, are needed to manage unstructured data, as well as data of diverse formats. Cloud solutions with High Performance Computing (HPC) are increasingly relied upon for tasks that traditional computing facilities cannot handle.
Despite the promise of wearable devices to provide real-time data on the health status of individuals, there are still outstanding issues of harmonizing the information gathered from diverse device types. In addition, there is presently no coherent effort to validate the various devices in popular use, or to create a framework to dependably store the data for aggregation purposes.
4.2 Analytical issues
Despite the considerable advances made in Big data analytics, there are still pertinent methodological issues that limit the potential use of the so-called machine learning tools in evidence-based medicine. In most cases, the operating characteristics of the procedures are not fully explored, and typical applications tend to focus on hypothesis generation, rather than confirmation. Indeed, the introduction of such tools as false discovery rates (FDR) notwithstanding, the issue of multiplicity remains pervasive.
For historical reasons, the development of most of the widely used techniques has evolved in silos, with little or no collaboration among key stakeholders. As was acknowledged in a recent report (National Research Council
2013), there now appears to be a realization that “… massive data analysis is not the province of any one field, but is rather a thoroughly interdisciplinary enterprise. Solutions to massive data problems will require an intimate blending of ideas from computer science and statistics, with essential contributions also needed from applied and pure mathematics, from optimization theory, and from various engineering areas, notably signal processing and information theory.”
4.3 Ethical concerns
An equally important challenge is the ethical issue about ownership of the data. At present there is no clear regulatory or legal framework or guidelines for use of Big Data in advancing medical research (see, e.g., Gray and Thorpe
2015; Williams and Javitt
2006). This requires the collaborative efforts of various stakeholders, including legal scholars, sociologists, and other pertinent professionals. Some of the steps that need to be taken may require enhancing existing measures to regulate the collection and use of health information, with particular emphasis on the challenges posed by the explosion of digitized data outside of the traditionally recognized healthcare sector. A critical component of a viable policy should also be the recognition of the role to be played by patients in the decision making regarding the use of personal information to advance medical science. Most importantly, a responsible public policy should be one that does foster innovation, while protecting privacy and the confidentiality of personal data.
4.4 Regulatory framework
From drug licensing perspectives, there is no definitive standard about the acceptability of data from non-RCT sources to support approval of a new medicine for use in humans. Hitherto, use of observational data has mainly been limited to the assessment of safety signals, and, in frequently, in post hoc exploration of drug utilization and other cost-benefit evaluations. In certain situations, including rare disease research, it may be essential to rely on data available at the time of drug approval (see, e.g., Stuart et al.
2001, for a research direction pertaining to the generalizability of results from RCTs). While the recent passage of the “21
st Century Cures Act” in US congress may eventually pave the way for use of data from other sources to support approval of new drugs (H.R.6 - 21st Century Cures Act
2015), more effort should be exerted to formulate a clear policy about the value and locus of such data in the evidence generation continuum for drug development and licensing. As mentioned earlier, there is an FDA sponsored initiative underway examining this issue.
5 Conclusion
By all accounts, the digital data era is poised to impact and revolutionize the development and targeting of new medicines. As real-world data becomes increasingly ubiquitous, it will routinely be used in healthcare decision-making and in providing actionable insights. However, to optimize the leveraging of these data, it is critical to understand the underlying limitations and associated challenges, and put in place mitigating measures.
A critical success factor for effective use of digitized data in drug development is a robust infrastructure that accommodates the volume, diversity and speed of the information generated by disparate sources and media. This should, of course, be accompanied by complementary methodological developments that seamlessly combine the elegance of traditional statistical theory, with the computational efficiencies honed in computer science and related domains. Such a task would indubitably require genuine collaborative efforts among pertinent disciplines, including statisticians, computer scientists and software engineers. In addition, there should be a concerted effort to recognize the underlying issues with disparate data generated by different owners, who may not have consistent agendas, and put in place an effective and transparent framework that would accelerate the use of data to advance medical research.
In the new era of digitized data, the need to protect patient privacy and confidentiality is more imperative than ever before. New ethical standards are required to ensure that information from individual subjects is properly used to advance medical science and to develop cures for hard-to-treat diseases. A balanced approach to protecting privacy, while promoting science, entails the concerted efforts of all relevant players, including ethicists, medical professionals, legal experts, and other stakeholders.
While there are promising signals in the regulatory arena, much work is still needed to give drug developers the requisite guidance for use of Big Data in supporting New Drug Applications (NDAs). Current guidelines are either limited to post-marketing safety surveillance or drugs intended for rare diseases. The recent announcement by the European Medicines Agency (EMA) to launch the so-called “adaptive licensing pilot project” has the implicit intent of encouraging sponsors to use data from real-world experience to support approval for gradual use by broader patient populations (EMA
2014). In the United States, the implementation of the 21st Century Cures Act may promote the development of concrete guidelines on the use of data from patient experience to support NDAs (H.R.6 - 21st Century Cures Act
2015).
Acknowledgments
This manuscript represents an expansion of keynote remarks by Marc L. Berger at the 11th International Conference on Health Policy Statistics (ICHPS 2015), Providence, RI, USA, October 7–9, 2015. The authors would like to thank the anonymous reviewers for considerably helpful suggestions.