Data source and selection procedure
We identified eligible phase III RCTs reported in four leading general medical journals in 2013: the New England Journal of Medicine (NEJM), the British Medical Journal (BMJ), the Lancet and the Journal of the American Medical Association (JAMA). The trials in our study differed from those in Trinquart et al. [
5]’s study by publication date and by the range of disease areas considered. We did not limit eligibility to oncology trials. The selection procedure was as follows.
We included superiority trials where the primary outcome was subjected to a time-to-event analysis. Our search string identified 586 potential articles. Two authors (JKR and BCO) independently reviewed the abstract, full text and (in some circumstances) the articles’ supporting material. Consensus was reached by discussion. We excluded analyses using pooled data from two or more trials and reports of secondary, subgroup or follow-up analyses. Of the 586 articles examined, 50 satisfied the following inclusion criteria: main study publication, phase III superiority trial, primary outcome was time to event, and the primary test of the null hypothesis was Cox or logrank. Five multi-arm trials were included, one 4-arm trial of macular degeneration, and four 3-arm trials of HIV, cancer, MRSA and cardiovascular disease.
Data extraction and reconstruction
We extracted information on type of disease, sample size, median follow-up time, primary endpoint, sample size and number of events. We ascertained whether a test of non-PH had been carried out, and if so, we recorded the type of test. We determined whether the PH assumption was violated, the nature of the violation and which (if any) methods for handling non-PH were considered. Finally, we noted whether a logrank or Cox test had been performed.
As in the procedure followed by Trinquart et al. [
5], we reconstructed individual participant data (IPD) for all patients in each treatment group from published Kaplan–Meier curves. We used the DigitizeIt graphical digitisation package (
https://www.digitizeit.de/) to read off the time and survival probability coordinates from the Kaplan–Meier curves. Where possible, we extracted the numbers of patients at risk and the total number of events. We estimated individual times to event or censoring by using the community-contributed Stata program
ipdfc
[
9]. The method is based on an algorithm in R described by Guyot et al. [
10].
Kaplan–Meier curves were digitised and IPD were reconstructed by an independent person under the supervision of BCO. We made informal visual checks of reconstructed Kaplan–Meier curves compared with those in the original publication, with satisfactory results. In an informal assessment, we found good agreement between the published estimates of the HR and its 95% confidence interval and those from the data produced by ipdfc
.
Combined test of the treatment effect
Under PH, the Cox test has optimal power. The motivation for the combined test is to capitalise on the strength of the Cox test when PH is (nearly) satisfied and to provide insurance (extra power) for cases in which it is not. Under some patterns of non-PH, the power of the Cox test is reduced, even drastically. We aimed to boost the power under such circumstances by combining the Cox test with a suitable additional test—hence the name ‘combined test’.
More generally, the standard null hypothesis in trial design is H0: HR=1 against the alternative H1: HR=δ. Usually δ<1, meaning a reduction (for example) in the mortality rate due to the research treatment. It may happen that H0 is not rejected at some predefined level α but that there are substantial, clinically relevant differences between the two survival curves. In view of the enormous costs and effort involved in mounting a phase III RCT, we would wish to avoid the conclusion that the treatment effect was non-significant and therefore, that the trial was negative solely because p>α on a test of the primary outcome that does not cover a wide enough range of relevant alternative hypotheses, that is of patterns of non-PH.
The challenge due to the limitations of the PH restriction has been recognised in the literature. Although in no way new, one approach, the RMST, has gained ground as a summary measure of a survival function and for comparing two survival curves. See, for example, A’Hern [
11] for an argument for its use in oncology trials. RMST is the mean of a time-to-event distribution truncated at a specific time point, sometimes denoted by
t∗. RMST has a clear interpretation. For example, with
t∗=3 yr, an RMST of 2.5 yr for a group of patients implies that when followed up for 3 yr, on average patients survive 2.5 yr.
The RMST is easily estimated as the area under the corresponding survival curve (e.g. a Kaplan–Meier curve) up to t∗. Unlike the HR, which is dependent on the model, the treatment effect may be quantified by the difference in RMST values at t∗, which requires no modelling assumption. Once t∗ has been selected, significance testing of the difference in RMST between groups is straightforward.
A potential weakness with such a use of RMST is the choice of
t∗. In a clinical trial paradigm, for a single test of RMST difference to be regarded as valid,
t∗ must be prespecified at the design stage. Under PH, a choice of
t∗ relatively late during the follow-up may confer power comparable to that of the Cox test [
12], but better choices of
t∗ may considerably increase the power under various patterns of non-PH. To accommodate this feature, Royston and Parmar [
2] suggested testing the RMST difference at several prespecified values of
t∗ during the follow-up, taking the smallest
p value as the basis of a test. They provided a method based on a permutation test to correct the resulting
p value, which is obviously too small. Lastly they took the smaller of the corrected
p value and that from the Cox test to give a putative
p value, again requiring correction for multiple testing. The final result was
p(CT), the
p value for the combined test, which has approximately the correct distribution under the global null hypothesis of equal survival curves,
H0:
S0(
t)=
S1(
t). Details of the approach are given in [
2]. An implementation of the combined test in Stata is described by Royston [
13]. Power and sample size calculations for the combined test have been implemented for Stata users by Royston [
7].