The determinants of long-term survival are currently of particular interest because of the dramatic increase in the number of patients surviving breast cancer matched to the observation that these women are never ‘cured’. Understanding the impact of prognostic factors and of treatment with time since diagnosis is therefore increasingly important. In this context, population-based data are crucial to understand how treatment influences the outcomes for all cancer patients. These aims should be distinguished from those of web-based models, which provide patients with an estimation of his/her survival according to his/her values of prognostic factors.
Our approach
In order to estimate the long-term associations of prognostic factors and treatment with the risk of dying from breast cancer, we used observational data from the population-based Geneva Cancer Registry. For this purpose, we restricted our cohort to a relatively homogeneous group of younger patients (less than 75) with localised disease (stage I and II) and who had received surgery. The severity of the disease was controlled for through the combination of several covariables, and the analyses accounted for differences in individual characteristics. Furthermore, the estimation of the mortality related to the disease, after controlling for other causes, was based on flexible excess hazard regression models, which enable the assumptions of linear and time constant excess hazard ratios to be relaxed. Both of these assumptions are clinically unlikely in the context of long-term survival. We used a recommended strategy [
22] for selection of covariables and their complex associations, and performed a sensitivity analysis to evaluate the reproducibility of the model [
23].
Despite using a fairly homogeneous group of patients, this optimised and up-to-date modelling strategy, a clear process for variable and complex association selection and a sensitivity analysis, our results demonstrated a lack of stability and model misspecification, associated with unrealistic effects of some treatments (e.g. chemotherapy).
Modelling issues
First, our sensitivity analysis demonstrated that the set of covariables included (eventually with NL and/or TD functional forms) in the “final” model for the excess mortality hazard was unstable. Because of this demonstrated instability, results obtained from a single model should be interpreted with caution. This is best illustrated by the fact that 20% of models did not reach convergence during the sensitivity analysis, as well as the fact that several variables selected for the single derived model were rarely retained in the sensitivity analysis (low BIF, e.g. TD for hormonal treatment). Meanwhile others not retained in the derived model were often selected by the sensitivity analysis (high BIF e.g. hormone receptors).
There are a number of possible reasons for this lack of robustness. The first is related to the context in which the study was conducted. Since breast cancer patients present with high survival, the number of events (“excess” death) is relatively low in breast cancer data, even where long-term follow-up is available. This is especially true for the fairly small Geneva population (495,000 inhabitants) and for the study population which was restricted to early-stage cancer patients. It is recommended that at least 10 events per parameter should be included when estimating regression coefficients [
25,
26]. Because we considered both time-dependent and non-linear associations for all prognostic variables, the number of parameters included in our model was large relative to the number of deaths. The convergence issues that we encountered are therefore likely to be explained, in part, by a lack of information from the observed data. However, decreasing the number of parameters (either by reducing the number of variables, or excluding some complex associations) would not have been a better strategy, given that our core aim was to try to better understand the long-term associations of prognostic covariables for breast cancer patients. Neither was it practical to increase the number of women in order to increase the number of events since this could only have been done by including women with advanced disease (for which treatment protocols are very different) or by including elderly women (who do not have the opportunity for long-term follow-up, and for whom the excess regression modelling would not make sense on the longer term [
10].
The analysis excluded 12.3% of the cohort because of missing data, thus leading to a loss of information. However this proportion is relatively low for these types of observational data and complete-case analyses have been proved to be sufficiently efficient for such ranges of missing data proportion [
24]. Also, our aim was to highlight the difficulties encountered with modelling in the context of observational data. We therefore performed a complete-case analysis in order not to dilute the message with issues related to multiple imputation.
It is possible that the lack of stability may have been a result of the modelling approach. We consider this unlikely, however. The flexible regression model we applied has been purposefully designed to estimate excess mortality hazard [
19] and take into account complex associations. The model selection strategy has previously been shown to be efficient and successful in detecting the correct complex associations as well as eliminating spurious ones [
22].
The second main issue was that our strategy was unable to fully control for confounding by indication leading to model misspecification. This would be an issue even with a perfectly robust model. This confounding is best illustrated by the unexpected results for chemotherapy and hormonal treatment. Women receiving these treatments experienced an increased risk of dying from breast cancer compared to women who did not receive them (Fig.
3). This reflects the fact that the patients in the cohort who received chemotherapy and hormonal treatment were probably those with more advanced disease at diagnosis, among the early-stage cases (Table
1). This represents a limitation of our strategy, which was not able to account for the fact that almost all women who were likely to benefit from these therapies were given them, resulting in a sparse comparison group within the patient cohort (confounding by indication). We performed a stratified analysis to explore this (data not shown). We grouped patients with very similar characteristics together and compared their survival according to receipt of chemotherapy or not. This similarly showed an increased risk in the excess hazard of death associated with chemotherapy. This strongly suggests that additional information about the prognosis of patients not receiving chemotherapy is missing from our dataset, and that this led to misspecification of the model.
In addition, interactions between treatment received and other co-variables might be required. Although we planned to examine the existence of such interactions, they were tricky to implement due to the convergence issues we encountered during the modelling process, and not reasonable to explore in our small sample size dataset.
Other possible strategies
Our results point towards the need for different statistical strategies in addition to our modelling strategy to be better able to examine these effects more than only the associations. Causal inference analyses would be one suitable approach [
27‐
29]. The objective of causal inference is to mimic the randomised trial that would have been set for the research question by using observational data and specific statistical techniques. Propensity score methods could, for example, be implemented within the flexible regression models we have used here [
30,
31]. In our work, we assumed that people were treated at the date of diagnosis, which is probably not correct for all patients. Also, some changes in the prognostic factors values for some patients (e.g. growth of the tumor size) may suggest that a treatment needs to be undertaken later on after the diagnosis. In the presence of such time-varying confounding, other approaches such as parametric g-formula [
32], structural nested models or marginal structural models with inverse probability weighting would also be of interest, especially for the long-term treatment effect [
33,
34] . All these approaches assume the models to be well specified, which is not so easy to achieve. Various approaches, including using machine learning techniques, have been developed to minimise model misspecification [
35]. This would however require much more detailed data, including comorbidities and other factors used to define the treatment choice. Furthermore, software to implement causal inference techniques is not yet available for the excess mortality hazard. Further methodological research is thus required to enable such analyses to be conducted.
Clinical interpretations
Nevertheless, a few cautious clinical interpretations can be drawn from these data. Some co-variables presented high BIFs within the sensitivity analysis and the observed associations appeared stable to the exclusion of outliers suggesting that they are indicative of a robust, underlying associations. Consistent with Jatoi et al. [
14] we found that patients with negative hormone receptors presented a higher excess mortality during the first years after diagnosis compared to those who have positive hormone receptors (BIF 95.4%). Regarding age at diagnosis, our results matched those found by Cluze et al. [
16] which showed the risk of dying from breast cancer was associated with increasing age at 1 and 5 years after diagnosis but that this association reversed at 10 years (BIF 87.1%). In addition to hormone receptor status and age at diagnosis, tumour size, grade and nodal involvement displayed associations which were similar to those described in a previous meta-analysis [
12]. Although our results are broadly consistent with previous studies, caution should be exercised in reporting the size of these associations, given that they have been derived from models, which display a lack of robustness. We observed a time-dependent association for radiotherapy: patients treated with radiotherapy exhibited a decreased excess mortality hazard in the first 10 years following their diagnosis but an increased hazard afterwards. This association was, however, sensitive to the inclusion or exclusion of outliers. That said, it could potentially correspond to late side effects of treatment, in particular cardiac complications, which are known as a likely consequence of irradiations given close to the heart [
36‐
38].