Background
Accuracy of logit coefficients in small samples
Separation
Outline of the paper
Methods
General
Simulation procedures
R
software version 3.1.1 [22]. For each data set, sampling was continued until the prespecified criteria for sample size and the number of events were met, keeping the first events and non-events generated up to the required number of each. This procedure ensured a fixed sample size (N) and number of events (EPV) in each data set. This approach, which is equivalent to the approach used by Vittinghoff and McCulloch [13], takes advantage of the properties of the logistic model where only the intercept is affected by this sampling procedure.glm
function in the stats
library (version: 3.1.1) and the logistf
function in the logistf
library (version: 1.21), respectively. To identify separation of simulation data sets the maximum likelihood standard errors of parameters were monitored through a re-estimation process [23]. This procedure is explained in detail in the Appendix. Unless otherwise specified: the default software criteria for convergence were used, calculation of the regression coefficient accuracy measures were based only on converged simulation results and maximum likelihood estimates for data sets that exhibited separation were excluded from the calculation of simulation results.Part I: Accuracy of logit coefficients
Study | ||||
---|---|---|---|---|
Factors | Ia | Ib | Ic | Id |
Sample size | ||||
EPV (with steps of) | 15 to 150 (5) | 15 to 150 (5) | 6 to 30 (2) | 6 to 30 (2) |
Outcome prevalence | 1/2 | 1/2 | 1/2,1/3,1/4,1/5,1/10 | 1/4 |
Range sample size | 30 to 300 | 60 to 1200 | 24 to 600 | 60 to 300 |
Effect size | ||||
Value of \(e^{\beta _{1}}\phantom {\dot {i}\!}\) | 1/4, 1/2, 1, 2, 4 | 2, 4 | 2 | 2 |
Value of \(e^{\beta _{j}}, j > 1\phantom {\dot {i}\!}\) | Not applicable | β1=…=β
P
| 2 | 2 |
Covariates | ||||
Number (P) | 1 | 2, 3, 4 | 2 | 2 |
Distribution | (Multivariate) standard normal | |||
Correlation | Not applicable | 0 | 0 | .1,.15,.2,.25 |
Part II: Detection and handling of separated data sets
IIa. Binary single covariate
IIb. Single simulation scenario, continuous covariate
glm
function default), tol: 1e-6, max-iter: 25 (Type I), tol: 1e-10, max-iter: 25 (Type II), tol: 1e-10, max-iter: 50 (Type III). Univariate covariate data were generated from standard normal distribution, the ratio of events to non-events was kept constant at 1:1. EPV was fixed at 4 and β1= log(4).Results
Part I: Accuracy of logit coefficients
Study | Study Ia* and Ib | Study Ic and Id | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
EPV | 15 to 30 | 35 to 50 | 55 to 150 | 6 to10 | 12 to 18 | 20 to 30 | ||||||
Estimator |
\(\beta _{1}^{ML}\)
|
\({\beta _{1}^{F}}\)
|
\(\beta _{1}^{ML}\)
|
\({\beta _{1}^{F}}\)
|
\(\beta _{1}^{ML}\)
|
\({\beta _{1}^{F}}\)
|
\(\beta _{1}^{ML}\)
|
\({\beta _{1}^{F}}\)
|
\(\beta _{1}^{ML}\)
|
\({\beta _{1}^{F}}\)
|
\(\beta _{1}^{ML}\)
|
\({\beta _{1}^{F}}\)
|
Bias | ||||||||||||
Average bias | 0.084 | 0.002 | 0.038 | 0.001 | 0.016 | 0.000 | 0.069 | 0.002 | 0.033 | 0.000 | 0.020 | 0.000 |
max | 0.261 | 0.016 | 0.091 | 0.005 | 0.056 | 0.006 | 0.217 | 0.021 | 0.075 | 0.011 | 0.046 | 0.005 |
min | 0.025 | -0.004 | 0.013 | -0.002 | 0.004 | -0.005 | 0.023 | -0.005 | 0.016 | -0.003 | 0.009 | -0.003 |
Average relative bias (%) | 7.8 | 0.1 | 3.6 | 0.1 | 1.5 | 0.0 | 8.4 | 0.4 | 4.8 | 0 | 2.9 | 0 |
max | 18.8 | 1.2 | 6.6 | 0.5 | 4.0 | 0.5 | 31.2 | 3.0 | 10.8 | 1.6 | 6.5 | 0.7 |
min | 3.5 | -0.5 | 1.9 | -0.3 | 0.5 | -0.7 | 3.3 | -0.7 | 2.3 | -0.5 | 1.3 | -0.0 |
>+10% relative bias (%) | 18.8 | 0 | 0 | 0 | 0 | 0 | 37.5 | 0 | 3 | 0 | 0 | 0 |
Coverage 90% CI | ||||||||||||
Average coverage (%) | 90.4 | 90.1 | 90.2 | 90.2 | 90.1 | 90.0 | 90.4 | 90.3 | 90.2 | 90.2 | 90.1 | 90.2 |
max | 92.9 | 90.8 | 91.1 | 90.7 | 91.0 | 90.7 | 92.1 | 91.2 | 90.8 | 90.6 | 90.9 | 90.8 |
min | 89.1 | 89.4 | 89.3 | 89.6 | 89.4 | 89.2 | 89.6 | 89.6 | 89.7 | 89.6 | 89.3 | 89.6 |
> ± 1% nominal (%) | 15.6 | 0 | 3.1 | 0 | 0.6 | 0 | 10 | 2.5 | 0 | 0 | 0 | 0 |
Average width | 1.102 | 1.059 | 0.752 | 0.738 | 0.487 | 0.483 | 1.183 | 1.133 | 0.828 | 0.811 | 0.653 | 0.646 |
Mean Square Error | ||||||||||||
Average MSE | 0.160 | 0.118 | 0.063 | 0.055 | 0.025 | 0.024 | 0.169 | 0.125 | 0.070 | 0.062 | 0.042 | 0.039 |
Separated data sets | ||||||||||||
Total (%) | 0.006 | 0 | 0 | 0.001 | 0 | 0 |
Part II: Detection and handling of separated data sets
EPV | 15 to 30 | 35 to 50 | 55 to 150 | |||
---|---|---|---|---|---|---|
Separated data removed | Yes | No | Yes | No | Yes | No |
Bias | ||||||
Average bias | -0.097 | 2.255 | 0.083 | 0.161 | 0.051 | 0.053 |
max | 0.091 | 7.074 | 0.127 | 0.439 | 0.084 | 0.096 |
min | -0.556 | 0.234 | 0.050 | 0.056 | 0.048 | 0.022 |
Average relative bias (%) | -0.087 | 2.110 | 0.079 | 0.145 | 0.048 | 0.049 |
max | 0.091 | 5.103 | 0.095 | 0.317 | 0.061 | 0.069 |
min | -0.401 | 0.338 | 0.069 | 0.081 | 0.032 | 0.032 |
Coverage 90% CI | ||||||
Average coverage (%) | 92.7 | 93.4 | 89.1 | 89.1 | 90.4 | 90.4 |
max | 98.3 | 98.8 | 90.6 | 90.6 | 91.8 | 91.8 |
min | 89.7 | 89.8 | 87.9 | 87.9 | 89.2 | 89.2 |
> ± 1% nominal (%) | 75 | 75 | 50 | 37.5 | 25 | 25 |
Average width | 4.087 | 4437.2 | 2.656 | 49.2 | 2.005 | 2.645 |
Mean Square Error | ||||||
Average MSE | 1.251 | 64.571 | 0.709 | 2.243 | 0.397 | 0.422 |
Separated data sets | ||||||
Total (%) | 13.2 | 4.2 | 0.006 |
Estimator |
\({\beta _{1}^{F}}\)
|
\(\beta _{1}^{ML} \)
|
\(\beta _{1}^{ML} \)
|
\(\beta _{1}^{ML} \)
|
\(\beta _{1}^{ML} \)
|
\(\beta _{1}^{ML} \)
|
\(\beta _{1}^{ML} \)
|
---|---|---|---|---|---|---|---|
Separation detection | NA | Tracingb | Estimatec | None | None | None | None |
Convergence criteriona | Default | Default | Default | Default | Type I | Type II | Type III |
Data sets removed (%) | 0 | 8.06 | 16.64 | 5.12 | 0.34 | 6.29 | 0.09 |
Bias | 0.012 | 0.569 | 0.186 | 1.672 | 17.5 | 0.856 | 41.3 |
Coverage 90% CI | 0.919 | 0.949 | 0.937 | 0.944 | 0.947 | 0.944 | 0.947 |
Mean width 90% CI | 4.32 | 4.50 | 3.64 | 5018 | 13620 | 6.03 | 1135784 |
MSE | 1.080 | 2.681 | 0.904 | 71.563 | 11532 | 319 | 173726 |
Discussion
Drivers of the accuracy of logit coefficients
SAS
, Stata
and R
).The impact of separated data sets on simulation results
Reasons for differences between EPV simulation studies
Conclusion
Appendix
brglm
package (Version 0.5-9) for R
by Ioannis Kosmidis. Separation for a parameter is said to occur if the variance of scaled standard errors (such that standard errors on first iteration equal 1) over refits was larger than 20. This cut-off value was chosen based on a small pilot study. Results not shown.