Scientific reason
At the heart of precision health is the notion of individualizing treatment based on the specifics of a single unit. Matching the right intervention to the right individual at the right time, in context, is contingent upon the inherent complexity of a phenomenon. On the simple end are problems like matching blood transfusions to blood types, which is relatively straightforward since the problem is (1) not dynamic (i.e., blood type does not change), (2) there is only one key cause (i.e., heredity), and (3) the mechanism is well understood and easily measurable to support clear classifications (e.g., type A, B, O, AB, +/−). A more complex problem is supporting adaptive dosing, such as anti-retroviral care, where the phenomenon is (1) dynamic (i.e., dosage is contingent upon changing white blood count) and (2) multi-causal, as a wide range of factors, beyond just the person’s disease state, influence white blood count. Nevertheless, often, such problems can be simplified into if/then adaptation rules because, like blood type, the mechanism is well-understood and characterized with appropriately validated measures. For problems in this class (i.e., low to moderate complexity), the big data approach to precision health will be very valuable.
However, there are highly complex health problems whose characteristics are poorly matched to using a big data approach alone. A good example of such problems is obesity prevention and treatment. As illustrated elsewhere [
7], obesity is highly complex since it is dynamic and multi-causal, and the mechanisms – even seemingly universal ones such as energy balance – manifest idiosyncratically. For example, it is well known that eating less facilitates weight loss. However, each person ‘eats less’ or struggles with eating less differently, based on food preferences, cultural practices, food access, time of day, learning history, etc. The level of calorie restriction required also varies, thus suggesting physiological differences. Individualizing prevention and treatment likely require that those idiosyncrasies be accounted for. Modest successes, particularly for achieving robust weight loss maintenance [
8,
9], suggest room for improvement for supporting individuals. As most major health issues today are chronic as opposed to acute [
10], in all likelihood, the level of complexity of the problems we seek to address will increasingly be closer to that of obesity than of blood type.
If the problems we face are more akin to obesity than to blood type, then the big data approach alone will be insufficient since the more dynamic, multi-causal, and idiosyncratically manifesting a problem is, the harder it will be to obtain the appropriate data types of meaningful causal factors at the appropriate temporal density from a large enough number of units. Data analytics that are based, in part, on identifying clusters and patterns across people will experience exponential growth of complexity of the modeling space, and thus require huge samples with long time series. Nevertheless, increasingly large datasets are becoming available. Thus, big data will play an important role, such as modeling variations in comorbidities across units.
Even with the large datasets available, the big data approach requires a great deal of knowledge about a phenomenon to ensure the right data types are included. For example, race is commonly measured, partially because it is relatively easy to measure via self-report and uses ‘standardized’ categories. Prior work is challenging assumptions about the meaning of this variable, particularly an implicit assumption that race is a biological as opposed to a socially constructed concept. ‘Race’ is largely contingent upon the cultural context for which an individual exists within [
11]. It is quite plausible that the categories of race create more noise than signal when used, particularly if they are treated as biological, immutable realities, which could propagate inequities from the research conducted [
12]. This issue will only magnify when data are aggregated across individuals. While we recognize this issue with race, it is quite plausible that similar hidden misclassifications exist, thus creating a high risk of inappropriate conclusions from big data. A central task, then, even when the goal is to use big data approaches, is to advance ways of gathering complementary prior knowledge to understand and analyze a complex phenomenon. This has classically occurred through clinical expertise and qualitative methods and, as justified herein, could be further supported with a small data approach.
Even if this colossally complex issue of obtaining the right data types at sufficient temporal density from a large enough sample based on robust prior knowledge were solved, if the mechanism is known to manifest idiosyncratically (see [
13] for many concrete examples), then big data will become not just insufficient but, potentially, problematic as it may wash out or ignore meaningful individual differences. For example, the behavioral science version of reinforcement learning (i.e., increasing future behaviors via giving rewards, like giving a dog food after sitting) is one of the most well understood drivers of behavior across organisms [
14,
15]. While the mechanism is universal, it manifests idiosyncratically [
14,
15]. Think, for example, of the pickiness of children. One child might find strawberries to be a reward whereas another child might find them to be aversive. Learning histories and individual preferences combine to create tremendous variability in how different people respond [
13] to both specific elements in the environment (e.g., strawberries) as well as classes of those elements (e.g., dessert). These concrete details of mechanism manifestation will be averaged out in aggregated analyses, yet it is precisely at that level of concreteness that treatments have to be individualized [
14‐
16]. Because of its focus on advancing goals of an N-of-1 unit and inclusion of that N-of-1 unit in the process, a small data approach has unique capabilities for issues that manifest idiosyncratically and, thus, are important for advancing precision health.
A small data approach uses different strategies to understand dynamic, multi-causal, and idiosyncratically manifesting phenomena, which can help to make these complexities more manageable. Within a big data paradigm, there is an implicit requirement that all plausibly meaningful variation is included in the dataset at a large enough scale to enable meaningful clusters and relationships in aggregate to be gleaned. Without this, what has been called ‘the black swan effect’ [
17], can occur, whereby a rare phenomenon not in a dataset is not deemed possible and, thus, not part of the modeling efforts. Using a small data approach, there is an incentive for people for whom the data are about to think carefully through insights collected from the data and, thus, to engage in gathering the right data types at sufficient temporal density to enable them to gather actionable insights for improved prediction and control for themselves. Further, a great deal of causal factors can be ruled out based on attributes of the person, context, or time, with the individual unit playing an important role in ruling out these possibilities (e.g., “I never eat those types of food; I’m not ever exposed to those environmental issues”). An individual understands their own lives, contexts, and preferences, which can facilitate specifying the idiosyncratic manifestations that need to be measured. For example, an individual may know – or could quickly learn – the degree to which salty foods versus sugary foods might trigger them to over eat. Finally, as discussed in detail below, a small data approach targets helping individuals first, not transportable knowledge first, which enables insights to be gleaned from data without the higher bar of those insights being generalizable to others.
In summary, from a scientific perspective, a small data approach has unique, complementary strategies for managing complex, dynamic, multi-causal, idiosyncratically manifesting phenomena compared to a big data approach, which could be valuable regardless of their value to big data approaches as well as for improving big data analytics.