Background
Prognosis is one of the central principles of medical practice. Understanding the likely course of a disease or condition is vital if clinicians are to treat patients with confidence or any degree of success. No two patients with the same diagnosis are exactly alike, and the differences between them – e.g. age, sex, disease stage, genetics – may have important effects on the course their disease will take. Such characteristics are called ‘prognostic factors’, and this phrase is usually taken to mean a factor which influences outcome independently of treatment.
For most applications, a single predictor is not sufficiently precise; rather a multivariable approach to prognosis is required. Multivariable prognostic research enables the development of tools which give predictions based on multiple important factors; these are variously called prognostic models, prediction models, prediction rules or risk scores [
1]. Such research also means that potential new prognostic factors are investigated more thoroughly, as it allows the additional value of the factor, above and beyond that of existing variables, to be established [
1].
The majority of prognostic research is done retrospectively, simply because results are obtained much more quickly and cheaply by using existing data. In their 2010 review, Mallett et al. [
2] found that 68 % of the 47 prognostic studies using time-to-event data included were retrospective. Altman [
3] conducted a review of publications which presented or validated prognostic models for patients with operable breast cancer, and found that of the 61 papers reviewed, 79 % were retrospective studies. Disadvantages to retrospective studies include missing data, a problem which in general cannot be mitigated by researchers. In addition, the assumption that data are missing at random may be implausible in such datasets, biasing results [
4]. This is particularly true with stored samples, for example McGuire et al. [
5] report that tumour banks usually contain a disproportionate number of samples from larger tumours, which may introduce bias. Existing datasets may also contain many more candidate variables than are really required to develop a good model, which can lead to multiple testing problems and a temptation to ‘dredge’ the data [
6].
The best way to study prognosis is in a prospective study, which ‘enables optimal measurement of predictors and outcome’ [
1]. However, a hurdle to designing good quality prognostic studies – whether prospective or retrospective – is ensuring that enough patients are included in order that the study has the required precision of results. In the second of a series of papers on prognosis research strategies, Riley et al. [
7] stress that in particular, studies aiming to replicate or confirm prognostic factors should ‘incorporate a suitable sample size calculation to ensure adequate power to detect a prognostic effect, if it exists’. Sample size is always an important issue for clinical studies; however, little research has been performed which pertains specifically to the sample size requirements of multivariable prognostic studies. In his review of 61 publications concerning breast cancer models, Altman [
3] found that none justified the sample size used; and for many it was impossible to discern the number of patients or events contributing to the final model. Mallett et al. [
2] found that although 96 % of studies in their review of survival models reported the number of
patients included in analyses, only 70 % reported the number of
events – a key quantity for time-to-event data. In the same review, 77 % of the studies included did not give any justification for the sample size used. It is perhaps unsurprising that most papers reporting prognostic research do not justify the sample sizes chosen, as little guidance is available to researchers on how many patients should be included in prognostic studies.
Calculations based on the standard formula for the Cox proportional hazards (PH) model [
8] are available for the situation where just one variable is of primary interest, but other correlated variables need to be taken into account in the analysis [
9‐
11]. For the more common scenario where researchers wish to produce a multivariable prognostic model and all model variables are potentially equally important, basing sample size on the significance of numerous individual variables is likely to be an intractable problem. In this situation the most often cited sample size recommendation is the rule of ‘10 events per variable’ (EPV) which originated from two simulation studies [
12,
13]. In these studies, exponential survival times were simulated for 673 patients from a real randomised trial with 252 deaths and 7 variables (36 EPV), and then the number of deaths were varied to reduce the EPV. The authors found that choosing a single minimum value for EPV was difficult but that results from studies having fewer than 10 EPV should be ‘cautiously interpreted’ in terms of power, confidence interval coverage and coefficient estimation for the Cox model. A later simulation study found that in ‘a range of circumstances’ having less than 10 EPV still provided acceptable confidence interval coverage and bias when using Cox regression, but did not directly consider the statistical power of analyses nor the variability of the estimates [
14]. It is perhaps inevitable that these two papers are often cited to justify low sample sizes. Indeed, Mallett et al. [
2] found in their review of papers reporting development of prognostic models in time-to-event data, that of the 28 papers reporting sufficient information to calculate EPV, 14 had fewer than 10 EPV.
In this paper, we take
multivariable prognostic model to mean a model which is a linear combination of weighted prognostic factors. However when developing such a model, the individual covariate effects of the prognostic factors may not be of major interest. Instead the main aim is likely to be measuring the ability of the model to predict outcomes for future patients, or to discriminate between groups of patients. Copas [
15] says that ‘ …a good predictor may include variables which are “not significant”, exclude others which are, and may involve coefficients which are systematically biased’. Thus basing sample size decisions on the significance of model coefficients alone may not result in the best prognostic model, as well as being complex when the model has multiple terms. Currently there seem to be very few sample size calculations or recommendations for developing or validating multivariable models which are based on the prognostic ability of a model, rather than the significance of its coefficients. During a literature search, few papers were retrieved which consider the issue from this angle. Smith, Harrell and Muhlbaier [
16] used simulation to assess the error in survival predictions with increasing numbers of model covariates. Datasets of 250 and 750 subjects (64 and 185 events respectively) were drawn from an exponential distribution such that the average 5-year survival was 75 %. Cox models were fitted to the simulated data, with between 1 and 29 uniformly distributed covariates. The authors found that in both the 64 and 185 event datasets, 5-year survival predictions from the Cox models became increasingly biased upwards as the EPV decreased. In both datasets, the average error was below 10 % when EPV >10, and below 5 % when EPV >20. For ’sick’ subjects – those at high risk of death – higher EPVs were required: EPV >20 was required to reduce the expected error to 10 %. This work suggests that an EPV of 20 may be considered a minimum if accuracy of predictions are important, however as it is found within a National Institutes of Health report, it is not easily available and so seems to be seldom cited. Additionally, two papers considered the effect of sample size on Harrell’s
c index. Ambler, Seaman and Omar [
17] noted that the value of the
c index increased with the number of events, however this issue was not the main focus of the publication and so investigation of this aspect was limited in scope. Vergouwe et al. [
18] considered the number of events required for reliable estimation of the
c index in logistic regression models and suggested that a minimum of 100 events and 100 non-events be used for external validation samples, which is likely to be higher than 10 EPV in many datasets. However being based on binary data, the results are not directly comparable to the sample size issue in prognostic models of time-to-event data.
In this paper we aim to develop calculations based on the prognostic ability of a model in time-to-event data, as quantified by Royston & Sauerbrei’s D measure of prognostic ability. We first describe the D statistic, and then present sample size calculations based on D for use in prognostic studies. Finally we give examples and describe suggested methods for increasing the practical usability of the calculations.
Conclusions
Prognostic studies using time-to-event data are often performed and appear frequently in medical literature. In general, the aim of such studies is to develop a multivariable model to predict the outcome of interest, and they often use time-to-event data analysed with the Cox proportional hazards model. Many prognostic studies are performed with retrospective data and often without reference to sample size calculations [
2], suggesting that obtaining reliable results from such studies may often be a matter of chance.
The main sample size guidance available to and used by researchers developing prognostic survival models is the events per variable (EPV) calculation with a lower limit of 10 EPV usually quoted; however, this idea is based on just two limited simulation studies. These studies concentrated on the significance of model coefficients, which is of secondary importance in a prognostic model to be used for outcome prediction. In this paper we have presented some sample size calculations based instead on the discrimination ability of a survival model, quantified by Royston and Sauerbrei’s D statistic. We have also given some suggestions and methods for improving the practical use of the calculations in research.
Due to the novel nature of the methods presented in this paper, there are limitations to the work described here and further avenues yet to be explored. In particular, we note that the sample size calculations presented here pay no attention to the number of variables to be explored. From previous work we know that the number of candidate variables for a model can have an effect on the estimate of
D in some situations [
23]. If a model is developed using an automatic variable selection method and then validated in the same dataset, then increasing the number of candidate variables increases the optimism present in the estimate of
D; however, we have not covered this issue here. Additionally, we acknowledge that changes in case mix between datasets can add complexity to defining improvement in the prognostic performance of a model, whether
D or some other performance measure is used. The methods introduced in [
29] may offer a solution to this problem but it is too early to say; in this paper we have made the assumption that the distributions of covariates are comparable between datasets used for model development and validation purposes.
We hope that these calculations, and the guidance provided for their use, will help improve the quality of prognostic research. As well as being used to provide sample sizes for prospective studies in time-to-event data, they can also be used for retrospective research; either to give the required sample size before suitable existing data is sought, or to calculate the likely precision of results where a dataset has already been chosen. At the very least we hope that the existence of these calculations will encourage researchers to consider the issue of sample size as a matter of course when developing or validating prognostic multivariable survival models.
Appendix: simulation studies to test sample size calculations
The sample size calculations were tested using simulation, to check that they provided the desired power and α, or the desired confidence interval width.
We simulated time-to-event data from an exponential distribution, with baseline cumulative hazard function
H0, using the method described by Bender et al. [
31]. The survival time for the proportion hazards (PH) model with regression coefficients (log hazard ratios)
β and covariate vector
X was simulated using
$$ T_{s}=H_{0}^{-1}[-\log (U)\exp (-\beta^{\prime }X)] $$
(7)
where
U∼
U[0,1]. Since simulating a full multivariable vector is complex both computationally and in terms of interpretation, we instead used a surrogate scalar
X.
X was simulated as ∼
N(0,1), and the value of
β fixed, so that the resulting prognostic index
βX was also normal. In a dataset simulated this way,
D=
βκ [
23].
We simulated random non-informative right-censoring using the same method to obtain an exponentially distributed censoring time T
c
for each patient; note T
c
were not dependent on X. Records where T
c
<T
s
were considered censored at time T
c
. The desired censoring proportion was achieved by changing the baseline hazard.
Throughout our simulations we wished to use datasets with an exact number of events and censoring proportion. To obtain a dataset with exactly e1 events and exact censoring proportion cens, we first generated a dataset with \(2(\frac {e_{1}}{1-cens})\) records and approximate censoring proportion cens. We then simply randomly selected e1 records ending in failure, and \(\frac {e_{1}}{1-cens}-e_{1}\) censored records, to form the final dataset.
The variance or standard error of D was obtained by bootstrap whenever required.
Calculations (B1) and (B2)
For (
B1) the first step of the simulation is to generate a ‘first’ study with
e1/(1−
cens) records and exactly
e1 events. This dataset is bootstrapped to obtain
\({\sigma _{1}^{2}}\), the variance of
D, and then
λ
s
is calculated from this quantity and
e1. For (
B2) the first step is to calculate
λ
m
from Eq. (
3) with the desired estimates of
D and cens.
The next steps are common to both (
B1) and (
B2) once
e2 is calculated. Datasets of the required size are generated separately under the null and alternative hypotheses, and bootstrapped to obtain
se(
D). The whole procedure is repeated 2000 times for each combination of parameters varied (
D, power,
δ and cens), and test statistics calculated to determine if the number of events
e2 gives the required power and type 1 error. A selection of results is given in Table
2. For (
B1), this table shows the results for
e1=750; the simulations were repeated for
e1=1500 and showed very similar results but these are not presented here.
Table 2
Results of simulation study to test (
B1) and (
B2)
β
|
D
| power |
δ
| cens | |
e
2
| % type 1 (se) | % power (se) | |
e
2
| % type 1 (se) | % power (se) |
1.0 | 1.6 | 80 % | 0.4 | 0 | | 222 | 5.5 (0.51) | 81.7 (0.86) | | 222 | 5.0 (0.49) | 81.5 (0.87) |
| | | | 80 | | 141 | 5.6 (0.51) | 80.8 (0.88) | | 133 | 5.1 (0.49) | 79.5 (0.90) |
2.0 | 3.2 | 90 % | 0.5 | 0 | | 495 | 4.0 (0.44) | 89.6 (0.68) | | 483 | 4.0 (0.44) | 88.1 (0.73) |
| | | | 80 | | 286 | 4.8 (0.48) | 92.1 (0.60) | | 291 | 4.4 (0.46) | 92.2 (0.60) |
Calculations (D1) and (D2)
As for (
B1), the first step of the simulation study for (
D1) is to generate a ‘first study’ to provide values of
e1 and
\({\sigma _{1}^{2}}\) for the calculation of
λ
s
.
For (
D2)
λ
m
is calculated using Eq. (
3). For both (
D1) and (
D2), once
e2 has been calculated, a dataset with the required number of events and censoring proportion is simulated and
D calculated. This was repeated 2000 times for each combination of parameters. The proportion of repetitions for which the estimate of
D is within
w of the input
D=
βκ gives the % CI which has width ±
w. This should approximate 1−
α, if the sample size calculation and estimation of
λ are correct. A selection of results is given in Table
3. For (
D1), this table shows the results for
e1=750; the simulations were repeated for
e1=1500 and showed very similar results but these are not presented here.
Table 3
Results of simulation study to test (
D1) and (
D2)
β
|
D
|
w
| cens | |
e
2
| % of D (se) within βκ±w | |
e
2
| % of D (se) within βκ±w |
1.0 | 1.6 | 0.2 | 0 | | 553 | 94.7 (0.50) | | 550 | 94.8 (0.50) |
| | | 80 | | 348 | 94.6 (0.51) | | 331 | 94.4 (0.51) |
2.0 | 3.2 | 0.3 | 0 | | 616 | 94.6 (0.51) | | 602 | 94.4 (0.51) |
| | | 80 | | 356 | 94.7 (0.50) | | 363 | 94.4 (0.52) |
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
All authors developed the methodology. RJ carried out the statistical analysis, simulation studies and literature searching for the D library, and drafted the manuscript. PR and MP input into the manuscript. All authors read and approved the final manuscript.