Background
Into the multiverse of prediction models
Main text
Examining the multiverse using bootstrapping and instability plots
Using the model development dataset of \(n\) participants from the chosen target population, we recommend the following process: |
• Step 1: Use the developed model to make predictions (\({\widehat{p}}_{i})\) for each individual participant (\(i=1\) to \(n\)) in the development dataset |
• Step 2: Generate a bootstrap sample with replacement, of size \(n\)
|
• Step 3: Develop a bootstrap prediction model in the bootstrap sample, replicating exactly (or as far as practically possible) the same model development approach and set of candidate predictors as used originally |
• Step 4: Use the bootstrap model developed in step 3 to make predictions for each individual (\(i)\) in the original dataset. We refer to these predictions as \({\widehat{p}}_{bi}\), where \(b\) indicates which bootstrap sample the model was generated in (\(b\) = 1 to \(B\)) |
• Step 5: Repeat steps 2 to 4 a total of \((B-1\)) times, and we suggest \(B\) is at least 200 |
• Step 6: Store all the predictions from the \(B\) iterations of steps 2 to 5 together in a single dataset, containing for each individual a prediction (\({\widehat{p}}_{i})\) from the original model and \(B\) predictions (\({\widehat{p}}_{1i} , {\widehat{p}}_{2i} ,\dots , {\widehat{p}}_{Bi})\) from the bootstrap models |
• Step 7: Summarise the instability in the predictions. In particular, quantify the mean absolute prediction ‘error’ (MAPE) for each individual, and summarise this across individuals, and display a prediction instability plot (scatter of the \(B\) predicted values for each individual against their original predicted value). Other instability plots (e.g. for classification, clinical utility) and measures may also be useful, as shown elsewhere [12]. |
Applied examples and the impact of sample size on instability
Why does instability in individual predictions matter in healthcare?
Instability of individual predictions and impact on discrimination
Instability of individual predictions and impact on calibration
Do instability concerns apply to AI or machine learning methods?
Addressing instability in the multiverse by targeting larger sample sizes
Instability checks are important for model fairness and model comparisons
Are there limitations with the bootstrap approach?
Might the multiverse be even more diverse?
Conclusions
“The Multiverse is a concept about which we know frighteningly little”Dr. Strange (Spider-Man: No Way Home)