While the different imputation approaches may not offer distinct benefits in terms of reducing the RMSE and MAR in some circumstances, there may still be situations where imputation at the item or subscale level is advantageous. This is true where the planned analysis includes not only the analysis of the composite scores, but also of the subscales (where applicable) or even the PROMs items. If feasible, imputation at the item or subscale level ensures that a common imputation dataset can be used for all analyses related to the relevant PROM.
Strengths and limitations
This research contributes to the literature in that it uses new datasets to validate previous work on the effect MI at the item and composite score level on the RMSE and MAR observed in the composite scores [
17] and treatment effects [
18] in PROMS analysis to different datasets and patient populations. In addition, previous research has been extended to additional questionnaires, and additional missing data scenarios, thus offering additional guidance to researchers faced with missing PROMs data in RCTs. This study covers a range of sample sizes (100 to approximately 1000) and rates of missing data (5, 10, 20 and 40%), which are representative of current figures observed in published RCTs [
14‐
16,
37,
38]. While RCTs with lower sample sizes are also common, these are often pilot and feasibility studies which focus on endpoints such as recruitment and completeness of endpoints, or are underpowered for the type of analyses used in this simulation study.
Although every effort was made to conduct this simulation study as thoroughly and completely as possible, it is not without limitations. Scenarios considered are limited to specific sample sizes, proportions of missing data and missing data patterns. However, we believe that sample sizes between 100 to around 1000 participants, and missing data levels between 5 and 40% are representative for the vast majority of RCTs. Future work on larger sample sizes, expanding the generalisability to larger-scale epidemiological research, is needed. These studies often collect a larger pool of patient demographics, the inclusion of which may affect the performance of the imputation models. Similarly, the missing data patterns used were based on those observed in the KAT trial. It is believed that these patterns are realistic and representative for the PROMs used, and we included variations in the amount of unit-nonresponse for the OKS simulations.
This simulation work is restricted to the KAT dataset. Additional validation work in further datasets, other disease areas, as well as PROMs may be useful to explore if the recommendations provided here still hold when the different approaches are applied to datasets with different correlations between baseline and outcome data, distributions of outcomes, different treatment mechanisms and different MAR patterns. However, the fact that findings by Simons et al. [
17] and Eekhout et al. [
18] could be replicated indicates that findings are generalisable. The main body of this simulation work used the missing data pattern observed in the KAT study at the 5-year follow-up. Additional missing data patterns (i.e. unit-nonresponse and increased levels of item missingness) were only considered for the OKS, in order to supplement the findings by Simons et al. [
17] on the effect of increased proportions of unit-nonresponse in the EQ-5D-3L.
Findings are limited to PROMs with up to 12 items. Therefore, uncertainty still exists as to the maximum number of items within a PROM for which item imputation would still be considered feasible, which is likely to be related to both the construct of the PROM, as well as the sample size. However, we believe that larger datasets are needed to ensure feasibility of item imputation for PROMs with more than 12 items, which are therefore not within the remit of this research.
In this simulation work, the relative performance of the imputation approaches appeared to be related to the outcome of interest, with smaller differences for the estimation of the treatment effect compared to the estimation of the composite scores across all scenarios. Further research is needed to establish if this is an artefact of these parameters being estimated on different scales (i.e. the OKS ranges from 0 to 48, while the treatment effects observed in the trial were nonsignificant and close to zero), or whether this is a more generalisable finding.
This study only considers analysis scenarios with a single follow-up time point. This approach was chosen because the primary analyses of many trials focus on the primary endpoint at a specific follow-up time point, rather than analyses approaches that take into account the longitudinal data. Imputation of PROMs item level data at additional time points was ruled out as infeasible due to the low convergence rates already observed in the current scenarios. While including in the item level imputation model of PROMs follow-up data at intermediate time points may have improved imputations at the five-year follow-up, in practice this data is often less well collected at the outcome data at the primary follow-up time point. This leads to additional complexity of the imputation models, and was therefore not included in this study. Researchers should examine on a case-by-case basis if sufficient intermediate or later follow-up data is available to benefit the imputation of missing outcome data.
Missing not at random (MNAR) mechanism and misspecification of MI models were not considered in this paper, although Simons et al. [
17] reported benefits of MI at the item level over MI at the score level for the latter scenario. However, it was felt that MI levels could be misspecified in a number of ways, and that the results from selected misspecifications may not be generalisable. This is because some variables are much more predictive of the missing data than others. The same applies to MNAR scenarios, which could be considered as misspecified MI models, as they are unable to account for important factors that are predictive of data being missing as well as the missing observations themselves. We recommend that MNAR analyses are best addressed as part of a sensitivity analysis [
39‐
41].
Some of the non-convergence rates observed in the results are very high, and could have been improved by simplifying the MI models. However, MI models were constructed using the full base case datasets, and were then applied to all sample size scenarios to allow a direct comparison of performance between the different scenarios. In reality, MI models should be generated based on the dataset under consideration, and should adjust their complexity based on the type and quantity of data available, and ensure that relevant variables that are good predictors of data being missing, and/or the variables to be imputed, as well as that the functional form of the imputation model is appropriate for the data. Here, the correlations between outcomes and the covariates used in the imputation models were low to moderate. While this is representative of RCTs in general, the inclusion of more highly correlated variables will improve imputation results. Researchers should also run all required imputations within the same model. The approach chosen in this simulation study, whereby item level imputations were run one-by-one to exclude occasional instances of non-converges was chosen as a compromise to increase convergence rates within these simulations.
The high failure rates in some of the simulations may have resulted in a systematic selection bias being observed for the results of the relevant simulation scenarios, due to item MI being more likely to fail in datasets with certain characteristics. MI at the item level is considered less likely to be feasible in these scenarios, which were typically those with smaller sample sizes and higher missing data rates. More likely, however, is that for the smaller sample sizes, the ordinal logit models used in the item regression are of suboptimal fit to produce reliable prediction to inform the imputations. For this reason, imputation at the composite score or subscale level is recommended for these scenarios. Simulations with higher convergence rates are not thought to be affected.
Different numbers of imputations were used for the imputations at the item level, mainly for practical reasons including time taken to perform large numbers of the imputations at this level, and were therefore inconsistent across some of the scenarios. The number of imputations performed were still in line with available guidance, and are therefore expected to produce robust results. However, it may be possible that the differences in the number of imputations has added some variation to the study results.
Finally, simulations were restricted to 1000 iterations, again mainly for practical reasons including time taken to perform large numbers of the imputations at the item level. Additional simulations (up to 5000) were run for isolated scenarios, and results were consistent with those presented in this paper.