Elsevier

Journal of Vascular Surgery

Volume 58, Issue 5, November 2013, Pages 1353-1359.e6
Journal of Vascular Surgery

Clinical research study
Comparative methods for handling missing data in large databases

https://doi.org/10.1016/j.jvs.2013.05.008Get rights and content
Under an Elsevier user license
open archive

Objective

Analysis of complex survey databases is an important tool for health services researchers. Missing data elements are challenging because the reasons for “missingness” are multifactorial, especially categorical variables such as race. We simulated missing data for race and analyzed the bias from five methods used in predicting major amputation in patients with critical limb ischemia (CLI).

Methods

Patient discharges with fully observed data containing lower extremity revascularization or major amputation and CLI were selected from the 2003 to 2007 Nationwide Inpatient Sample, a complex survey database (weighted n = 684,057). Considering several random missing data schemes, we compared five missing data methods: complete case analysis, replacement with observed frequencies, missing indicator variable, multiple imputation, and reweighted estimating equations. We created 100 simulated data sets, with 5%, 15%, or 30% of subjects' race drawn to be missing from the full data set. Bias was estimated by comparing the estimated regression coefficients averaged over 100 simulated data sets (βmiss) from each method vs estimates from the fully observed data set (βfull), with relative bias calculated as (βfull – βmissfull) × 100%.

Results

Our results demonstrate that reweighted estimating equations produce the least biased and the missing indicator variable produces the most biased coefficients. Complete case analysis, replacement with observed frequencies, and multiple imputation resulted in moderate bias. Sensitivity analysis demonstrated the optimal method choice depends on the quantity and type of missing data encountered.

Conclusions

Missing data are an important analytic topic in research with large databases. The commonly used missing indicator variable method introduces severe bias and should be used with caution. We present empiric evidence to guide method selection for handling missing data.

Cited by (0)

This work was supported by a National Institutes of Health K23 Research Career Development Award (HL-084386) to L.L.N., a National Research Service Award T32 Institutional Research Training Grant (HL-007734) to A.J.H, and a National Cancer Institutes Award (NCI CA-60679) to S.L.

Author conflict of interest: none.

Additional material for this article may be found online at www.jvascsurg.org.

The editors and reviewers of this article have no relevant financial relationships to disclose per the JVS policy that requires reviewers to decline review of any manuscript for which they may have a conflict of interest.