Introduction
Forecasts of the transmission and burden of COVID-19 provide public health officials advance warning that allows them to make informed decisions about how to modify their response to the pandemic [
1‐
9]. The COVID-19 pandemic has caused economic burdens to the US, overwhelmed hospitals with ill patients, and further highlighted social inequity and inequalities in access to healthcare [
10‐
15].
In response, several organized modeling efforts were started to give public health officials as up to date information as possible about the trajectory of COVID-19 in the US and in Europe [
7,
16‐
18].
The US COVID-19 Forecast Hub is a unified effort to house probabilistic forecasts of incident cases, deaths, and hospitalizations due to COVID-19 in a single, centralized repository [
16,
19]. The goal of this repository is to collect, combine, and evaluate forecasts of the trajectory of COVID-19 and communicate these forecasts to the public and to public health officials at the state and federal level [
20]. This repository is not meant to include all possible forecasting targets related to COVID-19, and models not included in the COVID-19 Forecast Hub have forecasted vaccine safety, efficacy, and timing, conditional trajectories of COVID-19 given public health action, time-varying
\(R_{0}\) values, hospital bed requirements, among others [
21‐
27]. The strength of the COVID-19 Forecast Hub is it’s ability to store, evaluate and communicate forecasting efforts systematically and focus modeling efforts that process objective, reportable data.
In addition to the US COVID-19 Forecast Hub, there are COVID-19 hubs that collect computational forecasts for Europe and specifically for Germany and Poland [
16‐
18]. The majority of models submitted to these hubs are computational: statistical or dynamical models trained on structured data.
Statistical models build a forecast by leveraging correlations between the current trajectory of COVID-19 and a set of covariates [
28‐
37]. Traditional data sources that were used to train models include historical counts of incident cases, deaths, and hospital admissions. A subset of models also train on novel sources of data such as self-reported COVID symptom rates and the rate of visits to a doctor, data related to mobility or contact among individuals, and social media data [
38‐
41].
Dynamical models first pose a deterministic relationship for how an outbreak is expected to evolve and then typically assume that the observed data follows a random process to account for uncertainty between the (conjectured to be true) deterministic process and what is reported [
42‐
44]. The most common dynamical models of the trajectory of COVID-19 extend compartmental models, models that assume individuals are in one of a finite set of states through the pandemic, to incorporate time varying reproduction numbers, multiple different data sources, and more complicated spatial structure [
45‐
48]. Dynamical models often excel at long term forecasts and generating a predictive density over an epidemiological variable of interest in response to public health action or potential scenarios [
47,
49‐
53].
Human judgment forecasting relies on the beliefs and activities of a crowd to generate (point or probabilistic) predictions over the possibilities of some future event. Below we present examples of three types of human judgment forecasting: prediction markets, incorporating passive human judgment data into a model, and collecting direct human judgment predictions.
Prediction markets have been developed to predict infectious diseases such as the 2009 swine flu, seasonal influenza, enterovirus, and dengue fever [
54‐
56]. A prediction market provides participants an initial amount of “money” to spend on future events and allows participants to place higher bids on events they think are more likely to occur. After bidding is complete, a model maps the “market price” for each event to a probability which is interpreted as the crowd’s belief that event will occur [
57]. Prediction markets rely on a large and diverse participant pool and the model that connects market price to predictive probability to make accurate predictions [
58,
59].
Passive human activity and behavior from social media outlets like Twitter and Facebook, and internet search history have been used as inputs to a model and have shown improved accuracy compared to a model that uses only epidemiological data for infectious agents like influenza, dengue fever, ZIKA, and COVID-19 [
60‐
65]. Most models (i) extract features from these social media outlets, (ii) transform the extracted social media data and include objective epidemiological data, and (iii) train a predictive model on this combination of objective, subjective data. Models using social media data are usually statistical or machine learning models, exploiting correlations between these data sources and the target of interest.
Direct predictions—either point predictions or probability densities—of the trajectory of an infectious agent have been elicited from individuals and aggregated for diseases such as influenza and COVID-19 [
21,
66‐
68]. Point forecasts have been elicited from experts from platforms like Epicast [
67]. Epicast asks participants to predict the entire trajectory of influenza-like illness (ILI), a marker for the severity of seasonal influenza, by viewing the current ILI time series and then drawing a proposed trajectory from the present week to the end of the influenza season. The aggregate model assigns a probability to an ILI value belonging in the bounded interval
\([x,x+\delta ]\) as the proportion of individual trajectories that fall within those bounds. The Epicast model was routinely one of the top performing models among several computational models submitted to the CDC sponsored FluSight challenge [
67].
Three projects to date have collected direct, probabilistic predictions from humans about the transmission and burden of the COVID-19 pandemic [
66,
68,
69]. As early as February 2020, human judgment platforms have made predictions of the trajectory of COVID-19 by enrolling experts in the modeling of infectious disease and asking them questions related to reported and true transmission, hospitalizations, and deaths due to SARS-CoV-2 [
66]. Experts were also asked to make predictions of transmission conditional on future public health actions. An equally weighted average of expert predictions was used to combine individual predictions into consensus predictions and reports from this work were generated from February 2020 to May 2020. This work found that, although there was considerable uncertainty assigned to confirmed cases and deaths, a consensus of expert predictions was robust to poor individual predictions, able to make accurate predictions of confirmed cases one week into the future, and gave an early warning signal of the severity of SARS-CoV-2. The second project compared predictions of rates of infection and number of deaths between those who were considered experts and laypeople in the United Kingdom [
69]. Participants were asked to assign a 12.5th and 87.5th percentile to four questions related to COVID-19—one question with ground truth and three with estimated values for the truth. Expert predictions were more accurate and calibrated than non-expert predictions, however expert predictions still underestimated the impact of COVID-19. A third project solicited from experts in statistics, forecasting, and epidemiology direct predictions of one through four week ahead incident and cumulative cases and deaths for Germany and Poland (at the national level) and aggregated these predictions into a “crowd forecast” [
68]. The crowd was able to produce more accurate, calibrated—as measured by the weighted interval score—predictive forecasts of cases in both countries compared to computational models, however computational models made more accurate predictions of deaths.
Human judgment predictions have been applied to a numerous number of fields beyond infectious disease and interested readers can find comprehensive reviews on the status and applications of human judgement forecasting [
21,
70,
71]. Select foundational works on aggregating human judgment may be found in the following citations [
71‐
75].
We propose an ensemble algorithm designed to generate forecasts of the trajectory of an infectious agent by combining direct, probabilistic predictions from computational models and human judgement models. We call this ensemble a chimeric ensemble. There exists in the literature many recipes for combining computational models and models of human judgment, and we include here only a small number of past works on this topic that we feel will provide the reader an introduction to the discipline [
76‐
85].
In this first hypothesis-generating work we: (i) explore the advantages and challenges when combining computational and human judgment models, (ii) compare the performance of a chimeric ensemble to a computational model only ensemble on six forecasts of incident cases and six forecasts of incident deaths due to COVID-19 at the US national level between January 2021 and June 2021, (iii) compare and contrast an algorithm that assigns different weights to computational models and human judgement based on past performance to an equally weighted combination of models, and (iv) finally shows how a chimeric ensemble can leverage human judgement data to improve predictive performance of an outbreak.
Discussion
We presented a first effort to combine direct probabilistic predictions of the spread and burden of an infectious agent generated by both computational models and human judgement.
A chimeric ensemble—a combination of forecasts generated by computational models and human judgment models—is capable of producing predictions that outperform an ensemble of computational models only. Though a chimeric ensemble has the potential to outperform a computational ensemble this is not always the case. Throughout these six surveys, a chimeric ensemble was also able to leverage at times poorer performing human judgement predictions to (i) outperform a computational ensemble and (ii) guard against relying too heavily on human judgement. Chimeric ensemble modeling is still in early stages and the reader should consider this work hypothesis generating.
There are several challenges to overcome when adding human judgment predictions.
Human judgment data must first be collected before predictions can be combined to produce a forecast. Data collection requires a team to pose questions to an audience of forecasters. Questions should be written as clear and concise as possible, to minimize bias, and written so that the forecaster understands how the truth will be determined (often called the resolution criteria). After questions are drafted they must be submitted to a prediction platform. A prediction platform should allow forecasters to easily view the question and resolution criteria, and allow the forecaster to submit their prediction with minimal effort. An immense amount of time and effort is needed to draft questions, and build and host a prediction platform. Organizing computational modeling efforts too requires an immense amount of effort to build [
16,
95,
96]. However, the time needed to host computational efforts and answer questions throughout the prediction period may be less burdensome than with a human judgement platform.
After data collection there continue to be challenges with human judgment predictions. In our opinion, the most pressing issue is missing forecasts. Compared to computational models, we found that human forecasters have a much higher rate of missing forecast submissions, and if one wishes to use only models that submitted all forecasts (a complete case approach) it may not be feasible to include human judgment. Instead, an imputation strategy should be used to account for missing human judgment forecasts. Here we proposed two potential strategies to account for missing forecasts: a “defer to the crowd” and “spotty memory” approach, and we found that both methods resulted in similar predictive performance of incident cases and deaths for most imputation functions, though the “defer to the crowd” strategy may produce more accurate predictions of cases when using a bayesian regression function to impute missing values and a spotty memory approach produced the most accurate forecasts when using median imputation. Both methods were able to incorporate more human judgment models in an ensemble than a complete case analysis. That said, the chimeric ensemble using a complete case approach with equal weights—the most natural approach— showed improved performance compared to a computational ensemble and is one of the best pieces of evidence that adding human judgement can improve forecasts of an infectious agent.
An additional challenge when incorporating human judgement into an ensemble is the time needed to collect these human judgement forecasts (See Additional file
1: Fig. S5). We’ve found in this work that the majority of forecasts are collected close to when the survey closes. This is likely because forecasters wait to collect as much information about a question as possible until submitting a prediction. Though in this work the time to collect human judgment forecasts did not pose challenges to building an ensemble, this may pose a problem to future human judgement forecasting tasks that must produce forecasts rapidly.
The need to couple ensemble modeling with an imputation strategy is not unique to chimeric forecasts, but we feel the proportion of missing forecasts is unique [
97]. Because the imputation strategies often fill in missing forecasts for a specific target with similar quantile values, one could consider the imputation approach we took to be a type of regularization and in past literature regularization was found to improve computational and human judgement ensembles [
98,
99].
Whether to use a performance based or equal weighting for a chimeric ensemble is still unclear. A performance based chimeric ensemble compared to an equally weighted ensemble showed improved performance for some surveys and weakened performance for other surveys using a spotty memory approach (Additional file
1: Fig. S3), and showed improved performance as additional data was collected for a defer to the crowd approach coupled with a chimeric ensemble when predicting cases (Additional file
1: Fig. S4). A challenge when ensemble modeling, in addition to choosing an algorithm to assign different weights to models, is to know in advance whether or not differential weighting will improve predictive performance and whether or not human judgement will improve or weaken predictive performance. Some factors that may help determine if differential weighting is useful or if human judgement should be included could be the difference in predicted median between a computational ensemble and human judgement ensemble, or potentially the difference in uncertainty in predictions. More work should focus on a three step approach to ensemble modeling: (i) predicting whether human judgement will improve predictive performance, (ii) predicting if differential weighting would benefit a set of models, and (iii) then either choosing equal weights or differential weights.
A chimeric and human judgement ensemble’s ability to improve predictions of incident cases is consistent with past work studying predictions of exclusively human judgment [
68]. Computational models often make more accurate predictions of deaths because they incorporate into their models reported cases, a signal for upcoming deaths. We are not sure whether or not humans considered the time series of incident cases when submitting predictions of deaths. Questions presented to forecasters did not suggest that cases could be a strong signal to consider when building a forecast for deaths. The question of how forecasters use time series information could lead to a controlled experiment to test human judgment’s ability to predict one time series by using a second, correlated time series. Previous literature suggests humans may make strong predictions that are short term, when there exists linear correlations between two concepts, and focus on information that most differed from their expectations [
100‐
102]. But to the best of our knowledge no work has been done in the area of multi-cue probability theory and judgemental forecasting of time series by providing a second correlated time series.
Because the effort a human can spend on prediction is finite, and because of the above results that show human judgement improves predictions of cases the most, we recommend asking crowds to predict cases or similar targets that are strongly correlated to others (such as incident deaths) which may (i) improve predictions of cases and (ii) improve predictions of deaths if these human judgement predictions were used as input to a computational forecasting model.
This work has several limitations. We only evaluated twelve targets in common with the COVID-19 Forecast hub and so the results above should be considered exploratory rather than confirmatory. The limited number of targets brings up the broader limitation that human judgement cannot be applied to a large number of targets, locations, and forecast horizons like computational models. The ensemble model we chose to optimize average WIS was deterministic, made no attempt to regularize weights assigned to models, and is just one type of method to aggregate computational and human judgement models. The number of human judgement participants, while excellent, was still a limitation at times. The empirical nature of this work, versus a controlled laboratory experiment, as well makes it difficult to draw strong conclusions about the performance of human judgement, computational models, and their combined performance.
In the future we plan to focus on methodology: (i) by building more advanced ensemble algorithms to combine computational and human judgement models, (ii) methods to determine for which targets human judgement is needed and which targets it is not needed, (iii) imputation procedures that take into account the uncertainty when filling in missing forecasts, and (iv) strategies that allow the ensemble builder to preferentially assign higher weights to either humans or computational models perhaps via a prior distribution; data collection: (i) by proposing strategies to reduce the number of missing human judgement forecasts; explore the limits of human judgement: (i) by testing to what degree humans can use one time series to predict another, (ii) how humans construct mental models and generate predictions, and (iii) what additional information can human judgement provide that is supportive of public health efforts.
We envision a chimeric ensemble as a flexible aggregation technique that can manage and combine predictions throughout the evolution of an infectious agent and as a supportive tool for public health. A chimeric ensemble can begin to support primary and secondary preventive measures by relying on fast acting human judgment to forecast targets while data is collected and computational models are trained. Once computational models begin to forecast, a chimeric ensemble can integrate these forecasts with no down time. As computational models become accurate for specific targets then human judgement can be used to predict noisier targets which can be included in this type of ensemble.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.