Introduction

Forecasting is one of the key purposes of epidemic modelling, and despite being related to the understanding of underlying mechanisms, it is a conceptually distinct task1,2. Explanatory models are often strongly idealised and tailored to specific settings, aiming to shed light on latent biological or social mechanisms. Forecast models, on the other hand, have a strong focus on observable quantities, aiming for quantitatively accurate predictions in a wide range of situations. While understanding of mechanisms can provide guidance to this end, forecast models may also be purely data-driven. Accurate disease forecasts can improve situational awareness of decision makers and facilitate tasks such as resource allocation or planning of vaccine trials3. During the COVID-19 pandemic, there has been a surge in research activity on epidemic forecasting. Contributions vary greatly in terms of purpose, forecast targets, methods, and evaluation criteria. An important distinction is between longer-term scenario or what-if projections and short-term forecasts4. The former attempt to discern the consequences of hypothetical scenarios (e.g., intervention strategies), a task closely linked to causal statements as made by explanatory models. Scenarios typically remain counterfactuals and thus cannot be evaluated directly using subsequently observed data. Short-term forecasts, on the other hand, refer to brief time horizons, at which the predicted quantities are expected to be largely unaffected by yet unknown changes in public health interventions. This makes them particularly suitable to assess the predictive power of computational models, a need repeatedly expressed during the pandemic5.

Rigorous assessment of forecasting methods should follow several key principles. Firstly, forecasts should be made in real time, as retrospective forecasting often leads to overly optimistic conclusions about performance. Real-time forecasting poses many challenges6, including noisy or delayed data, incomplete knowledge on testing and interventions as well as time pressure. Even if these are mimicked in retrospective studies, some benefit of hindsight remains. Secondly, in a pandemic situation with low predictability, forecast uncertainty needs to be quantified explicitly7,8. Lastly, forecast studies are most informative if they involve comparisons between multiple independently run methods9. Such collaborative efforts have led to important advances in short-term disease forecasting prior to the pandemic10,11,12,13. Notably, they have provided evidence that ensemble forecasts combining various independent predictions can lead to improved performance, similar to what has been observed in weather prediction14.

The German and Polish COVID-19 Forecast Hub is a collaborative project which, guided by the above principles, aims to collect, evaluate and combine forecasts of weekly COVID-19 cases and deaths in the two countries. It is run in close exchange with the US COVID-19 Forecast Hub15,16 and aims for compatibility with the forecasts assembled there. Close links moreover exist to a similar effort in the United Kingdom17. Other conceptually related works on short-term forecasting or baseline projections include those by consortia from Austria18 and Australia19 as well as the European Centre for Disease Prevention and Control20,21 (ECDC). In a German context, various nowcasting efforts exist22. All forecasts assembled in the German and Polish COVID-19 Forecast Hub are publicly available (https://github.com/KITmetricslab/covid19-forecast-hub-de23) and can be explored interactively in a dashboard (https://kitmetricslab.github.io/forecasthub). The Forecast Hub project moreover aims to foster exchange between research teams from Germany, Poland and beyond. To this end, regular video conferences with presentations on forecast methodologies, discussions and feedback on performance were organised.

In this work, we present results from a prospective evaluation study based on the collected forecasts. The evaluation procedure was prespecified in a study protocol24, which we deposited at the registry of the Open Science Foundation (OSF) on 8 October 2020. The evaluation period extends from 12 October 2020 (first forecasts issued) to 19 December 2020 (last observations made). This corresponds to the onset of the second wave of the pandemic in both countries. It is marked by strong virus circulation and changes in intervention measures and testing strategies, see Fig. 1 for an overview. This makes for a situation in which reliable short-term forecasting is both particularly useful and particularly challenging. Thirteen modelling teams from Germany, Poland, Switzerland, the United Kingdom and the United States contributed forecasts of weekly confirmed cases and deaths. Both targets are addressed on the incidence and cumulative scales and one through 4 weeks ahead, with evaluation focused on 1 and 2 weeks ahead. We find considerable heterogeneity between forecasts from different models and an overall tendency to overconfident forecasting, i.e., lower than nominal coverage of prediction intervals. While for deaths, a number of models were able to outperform a simple baseline forecast up to 4 weeks into the future, such improvements were limited to shorter horizons for cases. Combined ensemble predictions show good relative performance in particular in terms of interval coverage, but do not clearly dominate single-model predictions. Conclusions from 10 weeks of real-time forecasting are necessarily preliminary, but we hope to contribute to an ongoing exchange on best practices in the field. Note that the considered period is the last one to be unaffected by vaccination and caused exclusively by the "original” wild type variant of the virus. Early January marked both the start of vaccination campaigns and the likely introduction of the B.1.1.7 (alpha) variant of concern in both countries. Our study will be followed up until at least March 2021 and may be extended beyond.

Fig. 1: Forecast evaluation period.
figure 1

Weekly incident (a, b) confirmed cases and (c, d) deaths from COVID-19 in Germany and Poland according to data sets from the European Centre for Disease Prevention and Control (ECDC) and the Centre for Systems Science and Engineering at Johns Hopkins University (JHU). The study period covered in this paper is highlighted in grey. Important changes in interventions and testing are marked by letters/numbers and dashed vertical lines. Sources containing details on the listed interventions are provided in Supplementary Note 5.

Results

In the following we provide specific observations made during the evaluation period as well as a formal statistical assessment of performance. Particular attention is given to combined ensemble forecasts. Forecasts refer to data from the European Centre for Disease Prevention and Control25 (ECDC) or Johns Hopkins University Centre for Systems Science and Engineering26 (JHU CSSE); see the Methods section for the exact definition of targets and ensemble methods. Visualisations of 1- and 2-week-ahead forecasts on the incidence scale are displayed in Figs. 2 and 3, respectively, and will be discussed in the following. These figures are restricted to models submitted over (almost) the entire evaluation period and providing complete forecasts with 23 predictive quantiles. Forecasts from the remaining models are illustrated in Supplementary Note 7. Forecasts at prediction horizons of 3 and 4 weeks are shown in Supplementary Note 8. All analyses of forecast performance were conducted using the R language for statistical computing27.

Fig. 2: One-week-ahead forecasts.
figure 2

One-week-ahead forecasts of incident cases and deaths in Germany (a, b) and Poland (c, d). Displayed are predictive medians, 50% and 95% prediction intervals (PIs). Coverage plots (eh) show the empirical coverage of 95% (light) and 50% (dark) prediction intervals.

Fig. 3: Two-week-ahead forecasts.
figure 3

Two-week-ahead forecasts of incident cases and deaths in Germany (a, b) and Poland (c, d). Displayed are predictive medians, 50% and 95% prediction intervals (PIs). Coverage plots (eh) show the empirical coverage of 95% (light) and 50% (dark) prediction intervals.

Heterogeneity between forecasts

A recurring theme during the evaluation period was pronounced variability between model forecasts. Figure 4 illustrates this aspect for point forecasts of incident cases in Germany. The left panel shows the spread of point forecasts issued on 19 October 2020 and valid 1 to 4 weeks ahead. The models present very different outlooks, ranging from a return to the lower incidence of previous weeks to exponential growth. The graph also illustrates the difficulty of forecasting cases >2 weeks ahead. Several models had correctly picked up the upwards trend, but presumably a combination of the new testing regime and the semi-lockdown (marked as (a) and (b)) led to a flattening of the curve. The right panel shows forecasts from 9 November 2020, immediately following the aforementioned events. Again, the forecasts are quite heterogeneous. The week ending on Saturday 7 November had seen a slower increase in reported cases than anticipated by almost all models (see Fig. 2), but there was general uncertainty about the role of saturating testing capacities and evolving testing strategies. Indeed, on 18 November it was argued in a situation report from Robert Koch Institute (RKI) that comparability of data from calendar week 46 (9–15 November) to previous weeks is limited28. This illustrates that confirmed cases can be a moving target, and that different modelling decisions can lead to very different forecasts.

Fig. 4: Illustration of heterogeneity between incident case forecasts in Germany.
figure 4

a Point forecasts issued by different models and the median ensemble on 19 October 2020. b Point forecasts issued on 9 November 2020. The dashed vertical line indicates the date at which forecasts were issued. Events marked by letters a–d are explained in Fig. 1.

Forecasts are not only heterogeneous with respect to their central tendency, but also the implied uncertainty. As can be seen from Figs. 2 and 3, certain models issue very confident forecasts with narrow forecast intervals barely visible in the plot. Others—in particular LANL-GrowthRate and the exponential smoothing time series model KIT-time_series_baseline— show rather large uncertainty. For almost all forecast dates there are pairs of models with no or minimal overlap in 95% prediction intervals, another indicator of limited agreement between forecasts. As can be seen from the right column of Figs. 2 and 3 as well as Tables 1 and 2, most contributed models were overconfident, i.e., their prediction intervals did not reach nominal coverage.

Table 1 Detailed summary of forecast evaluation for Germany (based on ECDC data). C0.5 and C0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean weighted interval score.
Table 2 Detailed summary of forecast evaluation for Poland (based on ECDC data). C0.5 and C0.95 denote coverage rates of the 50% and 95% prediction intervals; AE and WIS stand for the mean absolute error and mean-weighted interval score.

Adaptation to changing trends and truth data issues

Far from all forecast models explicitly account for interventions and testing strategies (Table 3). Many forecasters instead prefer to let their models pick up trends from the data once they become apparent. This can lead to delayed adaptation to changes and explains why numerous models—including the ensemble—showed overshoot in the first half of November when cases started to plateau in Germany (visible from Fig. 2 and even more pronounced in Fig. 3). Interestingly, some models adapted more quickly to the flatter curve. This includes the human judgement approach EpiExpert, which, due to its reliance on human input, can take information on interventions into account before they become apparent in epidemiological data, but interestingly also Epi1Ger and EpiNow2 which do not account for interventions. In Poland, overshoot could be observed following the peak week in cases (ending on 15 November), with the 1-week-ahead median ensemble only barely covering the next observed value. However, most models adapted quickly and were back on track in the following week.

Table 3 Forecast models contributed by independent external research teams.

A noteworthy difficulty for death forecasts in Germany was under-prediction in consecutive weeks in late November and December. In November, several models predicted that death numbers would level off, likely as a consequence of the plateau in case numbers starting several weeks before. In the last week of our study (ending on 19 December) most models considerably under-estimated the increase in weekly deaths. A difficulty may have been that despite the overall plateau observed until early December, cases continued to increase in the oldest age groups, for which the mortality risk is highest (Supplementary Fig. 1). Models that ignore the age structure of cases— which includes most available models (Table 3)— may then have been led astray.

A major question in epidemic modelling is how closely surveillance data reflect the underlying dynamics. Like in Germany, testing criteria were repeatedly adapted in Poland. In early September they were tightened, requiring the simultaneous presence of four symptoms for the administration of a test. This was changed to less restrictive criteria in late October (presence of a single symptom). These changes limit comparability of numbers across time. Very high test positivity rates in Poland (Supplementary Fig. 2) suggest that there was substantial under-ascertainment, which is assumed to have aggravated over time. Comparisons between overall excess mortality and reported COVID deaths suggest that there is also relevant under-ascertainment of deaths, again likely changing over time29. These aspects make predictions challenging, and limitations of ground truth data sources are inherited by the forecasts which refer to them. A striking example of this was the belated addition of 22,000 cases from previous weeks to the Polish record on 24 November 2020. The Poland-based teams MOCOS and MIMUW explicitly took this shift into account while other teams did not.

Findings for median, mean and inverse-WIS ensembles

We assessed the performance of forecast ensembles based on various aggregation rules, more specifically a median, a mean and an inverse-WIS (weighted interval score) ensemble; see the Methods section for the respective definitions.

A key advantage of the median ensemble is that it is more robust to single extreme forecasts than the mean ensemble. As an example of the behaviour when one forecast differs considerably from the others we show forecasts of incident deaths in Poland from 30 November 2020 in Fig. 5. The first panel shows the six member forecasts, the second the resulting median and mean ensembles. The predictive median of the latter is noticeably higher as it is more strongly impacted by one model which predicted a resurge in deaths.

Fig. 5: Examples of median and mean ensembles.
figure 5

One- and 2-week-ahead forecasts of incident deaths in Poland issued on 30 November, and of incident cases in Poland issued on 2 November 2020. Panels (a and c) show the respective member forecasts, panels (b and d) the resulting ensembles. Both predictive medians and 95% (light) and 50% (dark) prediction intervals are shown. The dashed vertical line indicates the date at which the forecasts were issued.

A downside of the median ensemble is that its forecasts are not always well-shaped, in particular when a small to medium number of heterogeneous member forecasts is combined. A pronounced example is shown in the third and fourth panel of Fig. 5. For the 1-week-ahead forecast of incident cases in Poland from 2 November 2020, the predictive 25% quantile and median were almost identical. For the 2-week-ahead median ensemble forecast, the 50% and 75% quantile were almost identical. Both distributions are thus rather oddly shaped, with a quarter of the probability mass concentrated in a short interval. The mean ensemble, on the other hand, produces a more symmetric and thus more realistic representation of the associated uncertainty.

We briefly address the inverse-WIS ensemble, which is a pragmatic approach to giving more weight to forecasts with good recent performance. Figure 6 shows the weights of the various member models for incident deaths in Germany and Poland. Note that some models were not included in the ensemble in certain weeks, either because of delayed or missing submissions or due to concerns about their plausibility. While certain models on average receive larger weights than others, weights change considerably over time. These fluctuations make it challenging to improve ensemble forecasts by taking past performance into account, and indeed Tables 1 and 2 do not indicate any systematic benefits from inverse-WIS weighting. A possible reason is that models get updated continuously by their maintainers, including major revisions of methodology.

Fig. 6: Examples of inverse WIS weights.
figure 6

Inverse-WIS (weighted interval score) weights for forecasts of incident deaths in (a) Germany and (b) Poland.

Formal forecast evaluation

Forecasts were evaluated using the mean-weighted interval score (WIS), mean absolute error (AE) and interval coverage rates. The WIS is a generalisation of the absolute error to probabilistic forecasts and negatively oriented, meaning that smaller values are better (see the Methods section). Tables 1 and 2 provide a detailed overview of results by country, target and forecast horizon, based on data from the European Centre for Disease Prevention and Control25 (ECDC). We repeated all evaluations using data from the Centre for Systems Science and Engineering at Johns Hopkins University26 (JHU CSSE) as ground truth (Supplementary Note 7), and the overall results seem robust to this choice. We also report on 3- and 4-week-ahead forecasts in Supplementary Note 8, though for reasons discussed in the Methods section, we consider their usability limited. To put the results of the submitted and ensemble forecasts into perspective we created forecasts from three baseline methods of varying complexity, see Methods section.

 Figure 7 depicts the mean WIS achieved by the different models on the incidence scale. For models providing only point forecasts, the mean AE is shown, which as detailed in the Methods section, can be compared to mean WIS values. A simple model always predicting the same number of new cases/deaths as in the past week (KIT-baseline) serves as a reference. For deaths, the ensemble forecasts and several submitted models outperform this baseline up to three or even 4 weeks ahead. Deaths are a more strongly lagged indicator, which favours predictability at somewhat longer horizons. Another aspect may be that at least in Germany, death numbers have been following a rather uniform upward trend over the study period, making it relatively easy to beat the baseline model. For cases, which are a more immediate measure, almost none of the compared approaches meaningfully outperformed the naive baseline beyond a horizon of 1 or 2 weeks. Especially in Germany this result is largely due to the aforementioned overshoot of forecasts in early November. The KIT-baseline forecast always predicts a plateau, which is what was observed in Germany for roughly half of the evaluation period. Good performance of the baseline is thus less surprising. Nonetheless, these results underscore that in periods of evolving intervention measures meaningful case forecasts are limited to a rather short time window. In this context we also note that the additional baselines KIT-extrapolation_baseline and KIT-time_series_baseline do not systematically outperform the naive baseline and for most targets are neither among the best nor the worst performing approaches.

Fig. 7: Forecast performance 1 through 4 weeks ahead.
figure 7

Mean-weighted interval score (WIS) by target and prediction horizon in Germany (a, b) and Poland (c, d). We display submitted models and the preregistered median ensemble (logarithmic y-axis). For models providing only point forecasts, the mean absolute error (AE) is shown (dashed lines). The lower boundary of the grey area represents the baseline model KIT-baseline. Line segments within the grey area thus indicate that a model fails to outperform the baseline. The numbers underlying this figure can be found in Tables 1 and 2.

In exploratory analyses (Supplementary Fig. 9) we did not find any clear indication that certain modelling strategies (defined via the five categories used in Table 3) performed better than others. Following changes in trends, the human judgement model epiforecasts-EpiExpert showed good average performance, while growth rate approaches had a stronger tendency to overshoot (Supplementary Figs. 58). Otherwise, variability of performance within model categories was pronounced and no apparent patterns emerged.

The median, mean and inverse-WIS ensembles showed overall good, but not outstanding relative performance in terms of mean WIS. At a 1-week lead time, the median ensemble outperformed the baseline forecasts quite consistently for all considered targets, showing less variable performance than most member models (Supplementary Figs. 58). Differences between the ensemble approaches are minor and do not indicate a clear ordering. We re-ran the ensembles retrospectively using all available forecasts, i.e., including those submitted late or excluded due to implausibilities. As can be seen from Supplementary Tables 5 and 6, this led only to minor changes in performance. Unlike in the US effort30,31, the ensemble forecast is not strictly better than the single-model forecasts. Typically, performance is similar to some of the better-performing contributed forecasts, and sometimes the latter have a slight edge (e.g., FIAS_FZJ-Epi1Ger for cases in Germany and MOCOS-agent1 for deaths in Poland). Interestingly, the expert forecast epiforecasts-EpiExpert is often among the more successful methods, indicating that an informed human assessment sets a high bar for more formalised model-based approaches. In terms of point forecasts, the extrapolation approach SDSC_ISG-TrendModel shows good relative performance, but only covers 1-week-ahead forecasts.

The 50% and 95% prediction intervals of most forecasts did not achieve their respective nominal coverage levels (most apparent for cases 2 weeks ahead). The statistical time series model KIT-time_series_baseline features favourably here, though at the expense of wide forecast intervals (Fig. 2). While its lack of sharpness leads to mediocre overall performance in terms of the WIS, the model seems to have been a helpful addition to the ensemble by counterbalancing the overconfidence of other models. Indeed, coverage of the 95% intervals of the ensemble is above average, despite not reaching nominal levels.

A last aspect worth mentioning concerns the discrepancies between results for 1-week-ahead incident and cumulative quantities. In principle these two should be identical, as forecasts should only be shifted by an additive constant (the last observed cumulative number). This, however, was not the case for all submitted forecasts, and coherence was not enforced by our submission system. For the ensemble forecasts the discrepancies are largely due to the fact that the included models are not always the same.

Discussion

We presented results from a preregistered forecasting project in Germany and Poland, covering 10 weeks during the second wave of the COVID-19 pandemic. We believe that such an effort is helpful to put the outputs from single models in context, and to give a more complete picture of the associated uncertainties. For modelling teams, short-term forecasts can provide a useful feedback loop, via a set of comparable outputs from other models, and regular independent evaluation. A substantial strength of our study is that it took place in the framework of a prespecified evaluation protocol. The criteria for evaluation were communicated in advance, and most considered models covered the entire study period.

Similarly to Funk et al.17, we conclude that achieving good predictive accuracy and calibration is challenging in a dynamic epidemic situation. Epidemic forecasting is complicated by numerous challenges absent in, e.g., weather forecasting32. Noisy and delayed data are an obstacle, but the more fundamental difficulty lies in the complex social (and political) dynamics shaping an epidemic33. These are more relevant for major outbreaks of emerging diseases than for seasonal diseases, and limit predictability to rather short time horizons.

Not all included models were designed for the sole purpose of short-term forecasting, and could be tailored more specifically to this task. Certain models were originally conceived for what-if projections and retrospective assessments of longer-term dynamics and interventions. This focus on a global fit may limit their flexibility to align closely with the most recent data, making them less successful at short forecast horizons compared to simpler extrapolation approaches. We observed pronounced heterogeneity between the different forecasts, with a general tendency to overconfident forecasting. While over the course of 10 weeks, some models achieved better average scores than others, relative performance has been fluctuating considerably.

Various works on multi-model disease forecasting discuss performance differences between modelling approaches, most commonly between mechanistic and statistical approaches. Reich et al.13, McGowan et al.34 (both seasonal influenza) and Johansson et al.12 (dengue) find slightly better performance of statistical than mechanistic models. All these papers find ensemble approaches to perform best. Forecasting of seasonal and emerging diseases, however, differ in important ways, the latter typically being subject to more variation in reporting procedures and interventions. This, along with the limited amount of historical data, may benefit mechanistic models. In our study we did not find any striking patterns, but this may be due to the relatively short study period. We expect that forecast performance is also shaped by numerous other factors, including methods used for model calibration, the thoroughness of manual tuning and input on new intervention measures or population behaviour.

Different models may be particularly suitable for different phases of an epidemic17, which is exemplified by the fact that some models were quicker to adjust to slowing growth of cases in Germany. In particular, we noticed that forecasts based on human assessment performed favourably immediately after changes in trends. These aspects highlight the importance of considering several independently run models rather than focusing attention on a single one, as is sometimes the case in public discussions. Here, collaborative forecasting projects can provide valuable insights and facilitate communication of results. Overall, ensemble methods showed good, but not outstanding relative performance, notably with clearly above-average coverage rates and more stable performance over time. An important question is whether ensemble forecasts could be improved by sensible weighting of members or post-processing steps. Given the limited amount of available forecast history and rapid changes in the epidemic situation, this is a challenging encounter, and indeed we did not find benefits in the inverse-WIS approach.

An obvious extension to both assess forecasts in more detail and make them more relevant to decision makers is to issue them at a finer geographical resolution. During the evaluation period covered in this work, only three of the contributed forecast models (ITWW-county_repro and USC-SIkJalpha, LeipzigIMISE-SECIR for the state of Saxony) also provided forecasts at the sub-national level (German states, Polish voivodeships). Extending this to a larger number of models is a priority for the further course of the project.

In its present form, the platform covers only forecasts of confirmed cases and deaths. These commonly addressed forecasting targets were already covered by a critical mass of teams when the project was started. Given limited available time resources of teams, a choice was made to focus efforts on this narrow set of targets. The was also motivated by the strong focus German legislators have put on seven-day incidences, which have been the main criteria for the strengthening or alleviation of control measures. However, there is an ongoing debate on the usefulness of this indicator, with frequent claims to replace it by hospital admissions35. An extension to this target was considered, but in view of emerging parallel efforts and open questions on data availability not prioritised. Given that in a post-vaccination setting the link between case counts and healthcare burden is expected to change, however, this decision will need to be re-assessed.

Estimation of total numbers of infected (including unreported) and effective reproductive numbers are other areas where a multi-model approach can be helpful (see ref. 36 for an example of the latter). While due to the lack of appropriate truth data these do not qualify as true prediction tasks, ensemble averages can again give a better picture of the associated uncertainty.

The German and Polish Forecast Hub will continue to compile short-term forecasts and process them into forecast ensembles. With the start of vaccine rollout and the emergence of new variants in early 2021, models face a new layer of complexity. We aim to provide further systematic evaluations for future phases, contributing to a growing body of evidence on the potential and limits of pandemic short-term forecasting.

Methods

We now lay out the formal framework of our evaluation study. Unless stated differently, the described approach is the same as in the study protocol24.

Submission system and rhythm

All submissions were collected in a standardised format in a public repository to which teams could submit (https://github.com/KITmetricslab/covid19-forecast-hub-de23). For teams running their own repositories, the Forecast Hub Team put in place software scripts to re-format forecasts and transfer them into the Hub repository. Participating teams were asked to update their forecasts on a weekly basis using data up to Monday. Submission was possible until Tuesday 3 p.m. Berlin/Warsaw time. Delayed submission of forecasts was possible until Wednesday, with exceptional further extensions possible in case of technical issues. Delays of submissions were documented (Supplementary Note 6).

Forecast targets and format

We focus on short-term forecasting of confirmed cases and deaths from COVID-19 in Germany and Poland 1 and 2 weeks ahead. Here, weeks refer to Morbidity and Mortality Weekly Report (MMWR) weeks, which start on Sunday and end on Saturday, meaning that 1-week-ahead forecasts were actually 5 days ahead, 2-week ahead forecasts were twelve days ahead, etc. All targets were defined by the date of reporting to the national authorities. This means that modellers have to take reporting delays into account, but has the advantage that data points are usually not revised over the following days and weeks. From a public health perspective there may be advantages in using data by symptom onset; however, for Germany, the symptom onset date is only available for a subset of all cases (50–70%), while for Poland no such data were publicly available during our study period. All targets were addressed both on cumulative and weekly incident scales. Forecasts could refer to both data from the European Centre for Disease Prevention and Control25 (ECDC) and Johns Hopkins University Centre for Systems Science and Engineering26 (JHU CSSE). In this article, we focus on the preregistered period of 12 October 2020 to 19 December 2020 (see Fig. 1). Note that on 14 December 2020, the ECDC data set on COVID-19 cases and deaths in daily resolution was discontinued. For the last weekly data point we therefore used data streams from Robert Koch Institute and the Polish Ministry of Health that we had previously used to obtain regional data and which up to this time had been in agreement with the ECDC data.

Most forecasters also produced and submitted 3- and 4-week-ahead forecasts (which were specified as targets in the study protocol). These horizons, also used in the US COVID-19 Forecast Hub15, were originally defined for deaths. Owing to their lagged nature, these were considered predictable independently of future policy or behavioural changes up to 4 weeks ahead; see37 for a similar argument. During the summer months, when incidence was low and intervention measures largely constant, the same horizons were introduced for cases. As the epidemic situation and intervention measures became more dynamic in autumn, it became clear that case forecasts further than 2 weeks (12 days) ahead were too dependent on yet unknown interventions and the consequent changes in transmission rates. It was therefore decided to restrict the default view in the online dashboard to 1- and 2-week-ahead forecasts only. At the same time we continued to collect 3- and 4-week-ahead outputs. Most models (with the exception of epiforecasts-EpiExpert, COVIDAnalytics-Delphi and in some exceptional cases MOCOS-agent1) do not anticipate policy changes, so that their outputs can be seen as “baseline projections”, i.e., projections for a scenario with constant interventions. In accordance with the study protocol we also report on 3- and 4-week-ahead predictions, but these results have been deferred to Supplementary Note 8.

Teams were asked to report a total of 23 predictive quantiles (1%, 2.5%, 5%, 10%, …, 90%, 95%, 97.5%, 99%) in addition to their point forecasts. This motivates considering both forecasts of cumulative and incident quantities, as predictive quantiles for these generally cannot be translated from one scale to the other. Not all teams provided such probabilistic forecasts, though, and we also accepted pure point forecasts.

Evaluation measures

The submitted quantiles of a predictive distribution F define 11 central prediction intervals with nominal coverage level 1 − α where α = 0.02, 0.05, 0.10, 0.20, …, 0.90. Each of these can be evaluated using the interval score38:

$${{{\mbox{IS}}}}_{\alpha }(F,y)=(u-l)\,+\,\frac{2}{\alpha }\times (l-y)\times \chi (y \, < \, l)\,+\,\frac{2}{\alpha }\times (y-u)\times \chi (y \, > \, u).$$
(1)

Here u and l are the lower and upper ends of the respective interval, χ is the indicator function and y is the eventually observed value. The three summands can be interpreted as a measure of sharpness and penalties for under- and overprediction, respectively. The primary evaluation measure used in this study is the weighted interval score39 (WIS), which combines the absolute error (AE) of the predictive median m and the interval scores achieved for the eleven nominal levels. The WIS is a well-known quantile-based approximation of the continuous ranked probability score38 (CRPS) and, in the case of our 11 intervals, defined as

$$\,{{\mbox{WIS}}}\,(F,y)=\frac{1}{11.5}\times \left(\frac{1}{2}\times | y-m| \,+\,\mathop{\sum }\limits_{k=1}^{11}\left(\frac{{\alpha }_{k}}{2}\times {{{\mbox{IS}}}}_{{\alpha }_{k}}(F,y)\right)\right),$$
(2)

where α1 = 0.02, α2 = 0.05, α3 = 0.10, α4 = 0.20, …, α11 = 0.90. Both the IS and WIS are proper scoring rules38, meaning that they encourage honest reporting of forecasts. The WIS is a generalisation of the absolute error to probabilistic forecasts. It reflects the distance between the predictive distribution F and the eventually observed outcome y on the natural scale of the data, meaning that smaller values are better. As secondary measures of forecast performance we considered the absolute error (AE) of point forecasts and the empirical coverage of 50% and 95% prediction intervals. In this context we note that WIS and AE are equivalent for deterministic forecasts (i.e., forecasts concentrating all probability mass on a single value). This enables a principled comparison between probabilistic and deterministic forecasts, both of which appear in the present study. Applying the absolute error implies that forecasters should report predictive medians, as pointed out in the paper describing the employed evaluation framework39.

In the evaluation we needed to account for the fact that forecasts can refer to either the ECDC or JHU data sets. We performed all forecast evaluations once using ECDC data and once using JHU data, with ECDC being our prespecified primary data source. For cumulative targets we shifted forecasts that refer to the other truth data source additively by the last observed difference. This is a pragmatic strategy to align forecasts with the last state of the respective time series.

A difficulty in comparative forecast evaluation lies in the handling of missing forecasts. For this case (which occurred for several teams) we prespecified that the missing score would be imputed with the worst (i.e., largest) score obtained by any other forecaster for the same target. The rationale for this was to avoid strategic omission of forecasts in weeks with low perceived predictability. In the respective summary tables any such instances are marked. All values reported are mean scores over the evaluation period, though if more than a third of the forecasts were missing we refrain from reporting.

Baseline forecasts

In order to put evaluation results into perspective we use three simple reference models. Note that only the first was prespecified. The two others were added later as the need for comparisons to simple, but not completely naive, approaches was recognised. More detailed descriptions are provided in Supplementary Note 2.

KIT-baseline

A naive last-observation carried-forward approach (on the incidence scale) with identical variability for all forecast horizons (estimated from the last five observations). This is very similar to the null model used by Funk et al.17.

KIT-extrapolation baseline

A multiplicative extrapolation based on the last two observations with uncertainty bands estimated from five preceding observations.

KIT-time series baseline

An exponential smoothing model with multiplicative error terms and no seasonality as implemented in the R package forecast40 and used for COVID-19 forecasting by Petropoulos and Makridakis41.

Contributed forecasts

During the evaluation period from October to December 2020, we assembled short-term predictions from a total of 14 forecast methods by 13 independent teams of researchers. Eight of these are run by teams collaborating directly with the Hub, based on models these researchers were either already running or set up specifically for the purpose of short-term forecasting. The remaining short-term forecasts were made available via dedicated online dashboards by their respective authors, often along with forecasts for other countries. With their permission, the Forecast Hub team assembled and integrated these forecasts. Table 3 provides an overview of all included models with brief descriptions and information on the handling of non-pharmaceutical interventions, testing strategies, age strata and the source used for truth data. More detailed verbal descriptions can be found in Supplementary Note 3. The models span a wide range of approaches, from computationally expensive agent-based simulations to human judgement forecasts. Not all models addressed all targets and forecast horizons suggested in our project; which targets were addressed by which models can be seen from Tables 1 and 2.

Ensemble forecasts

We assess the performance of three different forecast aggregation approaches:

KITCOVIDhub-median ensemble

The α-quantile of the ensemble forecast for a given quantity is given by the median of the respective α-quantiles of the member forecasts. The associated point forecast is the quantile at level α = 0.50 of the ensemble forecast (same for other ensemble approaches).

KITCOVIDhub-mean ensemble

The α-quantile of the ensemble forecast for a given quantity is given by the mean of the respective α-quantiles of the member forecasts.

KITCOVIDhub-inverse WIS ensemble

The α-quantile of the ensemble forecast is a weighted average of the α-quantiles of the member forecasts. The weights are chosen inversely to the mean WIS value obtained by the member models over six recently evaluated forecasts (last three 1-week-ahead, last two 2-week-ahead, last 3-week-ahead; missing scores are again imputed by the worst score achieved by any model for the respective target). This is done separately for incident and cumulative forecasts. The inverse-WIS ensemble is a pragmatic strategy to base weights on past performance, which is feasible with a limited amount of historical forecast/observation pairs (see42 for a similar approach).

Only models providing complete probabilistic forecasts with 23 quantiles for all four forecast horizons were included into the ensemble for a given target. It was not required that forecasts be submitted for both cumulative and incident targets, so that ensembles for incident and cumulative cases were not necessarily based on exactly the same set of models. The Forecast Hub Team reserved the right to screen and exclude member models in case of implausibilities. Decisions on inclusion were taken simultaneously for all three ensemble versions and were documented in the Forecast Hub platform (file decisions_and_revisions.txt in the main folder of the repository). The main reasons for the exclusion of forecasts from the ensemble were forecasts in an implausible order of magnitude or forecasts with vanishingly small or excessive uncertainty. As it showed comparable performance to submitted forecasts, the KIT-time_series_baseline model was included in the ensemble forecasts in most weeks.

Preliminary results from the US COVID-19 Forecast Hub indicate better forecast performance of the median compared to the mean ensemble43, and the median ensemble has served as the operational ensemble since 28 July 2020. Up to date, trained ensembles yield only limited, if any, benefits30. We therefore prespecified the median ensemble as our main ensemble approach. Note that in other works19,44, ensembles have been constructed by combining probability densities rather than quantiles. These two approaches have somewhat different behaviour, but no general statement can be made which one yields better performance45. As in our setting member forecasts were reported in a quantile format we resort to quantile-based methods for aggregation.

Reporting summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.