A combined neural network and decision trees model for prognosis of breast cancer relapse

https://doi.org/10.1016/S0933-3657(02)00086-6Get rights and content

Abstract

The prediction of clinical outcome of patients after breast cancer surgery plays an important role in medical tasks such as diagnosis and treatment planning. Different prognostic factors for breast cancer outcome appear to be significant predictors for overall survival, but probably form part of a bigger picture comprising many factors. Survival estimations are currently performed by clinicians using the statistical techniques of survival analysis. In this sense, artificial neural networks are shown to be a powerful tool for analysing datasets where there are complicated non-linear interactions between the input data and the information to be predicted. This paper presents a decision support tool for the prognosis of breast cancer relapse that combines a novel algorithm TDIDT (control of induction by sample division method, CIDIM), to select the most relevant prognostic factors for the accurate prognosis of breast cancer, with a system composed of different neural networks topologies that takes as input the selected variables in order for it to reach good correct classification probability. In addition, a new method for the estimate of Bayes’ optimal error using the neural network paradigm is proposed. Clinical–pathological data were obtained from the Medical Oncology Service of the Hospital Clı́nico Universitario of Málaga, Spain. The results show that the proposed system is an useful tool to be used by clinicians to search through large datasets seeking subtle patterns in prognostic factors, and that may further assist the selection of appropriate adjuvant treatments for the individual patient.

Introduction

Prediction tasks are among the most interesting activities in which to implement intelligent systems. Specifically, prediction is an attempt to accurately forecast the outcome of a specific situation, using as input information obtained from a concrete set of variables that potentially describe the situation.

A problem often faced in clinical medicine is how to reach a conclusion about the prognosis of cancer patients when presented with complex clinical and prognostic information, since specialists usually make decisions based on a simple dichotomization of variables into a favourable and unfavourable classification [18]. As we enter the new millennium, treatment modalities exist for many solid tumour types and their use is well established. Nevertheless, offset against this is the toxicity of some treatments. As there is a real risk of mortality associated with treatment, it is vital to have the possibility of offering different therapies depending on the patients. In this sense, the likelihood that the patient will suffer a recurrence of her disease is very important, so that the risks and expected benefits of specific therapies can be compared.

This work analyses, on the one hand, the decision-making process existing when patients with primary breast cancer should receive a certain therapy to remove the primary tumour. On the other hand, different prognostic factors appear to be significant predictors for overall survival, but probably form part of a bigger picture comprising many, inter-related factors [11]. In order to investigate this hypothesis, studies looking at a large number of potential prognostic factors are needed. To further complicate matters, these relationships may well be non-linear in nature. These form the major difficulties in such studies. Furthermore, the statistical analysis of large datasets using standard methodologies is cumbersome and limited, especially in the case of non-linear relationships.

Among prognostic modelling techniques that induce models from medical data, survival analysis methods are specific both in terms of modelling and the type of data required. Survival models attempt to determine the probability of the event occurring within a specific time, which requires classification models that classify either the occurrence or non-occurrence of the event and optionally model the outcome probabilities. Several tools successfully used in the construction of medical prognosis models have been proposed by the machine learning community [17], [34].

Neural networks are a form of artificial intelligence that have found application in a wide range of problems [10], [20], [24] and have given, in many cases, superior results to standard statistical models [33]. Baxt [4] demonstrated the predictive reliability of an artificial neural networks model in medical diagnosis. In this case, we utilise the ability of neural networks to recognise complex and highly non-linear relationships, such as are likely to characterise medical circumstances.

Some authors [14], [30] have modelled systems for outcome prediction in post-surgery breast and lung carcinoma patients using neural networks to perform survival analysis. This type of modelling manages the problem of censored data handling that arises when the event related to the censor variable—normally included in the survival data (like death or recurrence of a disease)—has not occurred during the follow-up period for a patient, although the event may eventually occur. These authors have solved the problem by using different survival estimators to handle censored data for patients. This would imply that prognostic factors—for example, in breast cancer with adjuvant therapy after surgery—are not time-dependent, but this is not really true. That is, the strength of the prognostic factor is not the same for different time intervals. Different techniques for survival estimation, such as Kaplan–Meier analysis [15] and Cox Regression modelling [6] assume that the strength of a prognostic factor does not change over time. In addition, the existence of a “peak” of recurrence in the distribution of relapse probability [2] demonstrates that the recurrence probability is not the same over time. In this sense, if these statistical techniques are not appropriate to solve this problem, a possible solution would be to incorporate the whole set of prognostic factors pre-selected by medical experts (Section 3.1) as input to the neural networks system. This would involve removing all the patients with censor data; however, the cardinality of the resulting patient data vectors set would then become too small to constitute a significant representation of this problem.

This work proposes a new system approach based on: (1) specific topologies of neural networks for different time intervals during the follow-up time of the patients, considering the events occurring in different intervals as different problems; and (2) decision trees, useful in understanding the underlying relationships in breast cancer data, for selecting the most important prognostic factors corresponding to every time interval. This is not the first attempt to combine decision trees and neural networks [1], [7], but it does present different ways of integrating them.

In addition, we introduce a new decision trees algorithm, control of induction by sample division method (CIDIM), for reducing the number of rules and improving the selection of attributes from the database to become significant prognostic factors. Furthermore, a new upper-bound estimate of the problem-difficulty level, based on the correct classification Bayes’ probability, is also proposed.

Section snippets

Breast cancer overview

Breast cancer is a malignant tumour that has developed from cells of the breast. Although scientists know some of the risk factors (i.e. ageing, genetic risk factors, family history, menstrual periods, not having children, obesity) that increase a woman’s chance of developing breast cancer, they do not yet know what causes most breast cancers or exactly how some of these risk factors cause cells to become cancerous. Research is under way to learn more and scientists are making great progress in

Patient data

Data from 1035 patients with breast cancer disease from the Medical Oncology Service of the Hospital Clı́nico Universitario of Málaga, Spain were collected and recorded during the period 1990–2000. Data corresponding to every patient were structured in 85 fields containing information about post-surgical measurements, personal data, and type of treatment. Part of this information regarding patients is not relevant for predicting outcome, so that only 14 independent input variables—pre-selected

Results and discussion

Table 6 shows the number of patients and the selection of prognostic factors corresponding to every time interval (in months) of patients’ follow-up that were selected for training the neural networks system. After processing the patient database through the decision trees system (CIDIM algorithm), certain attributes appear to be the most significant prognostic factors (second column in Table 6) becoming the input to the artificial neural networks system. The decision trees system makes the

Conclusions

This paper presents a decision-support tool for the prognosis of breast cancer relapse using clinical–pathological data. We propose a model that combines a novel algorithm TDIDT (CIDIM), with a system composed of different neural network topologies to approximate Bayes’ optimal error for the prediction of patient relapse after breast cancer surgery. The CIDIM algorithm selects the most relevant prognostic factors for the accurate prognosis of breast cancer, while the neural networks system

Acknowledgements

We would like to thank the referees for their valuable comments and suggestions, and also the Oncology Service staff of the Hospital Clı́nico Universitario of Málaga for their comments and collaboration in this work. This work has been partially supported by the FRESCO project, number PB98-0937-C04-01, of CICYT Spain.

References (34)

  • H.A. Abbass et al.

    C-Net: a method for generating non-deterministic and dynamic multivariate decision trees

    Know. Inform. Syst.

    (2001)
  • Alba E et. al. Estructura del patron de recurrencia en el cancer de mama operable (CMO) tras el tratamiento primario....
  • S. Amari et al.

    Statistical theory of overtraining—is cross-validation asymptotically effective?

    Adv. Neural Inform. Process. Syst.

    (1996)
  • W. Buntine et al.

    A further comparison of splitting rules for decision-tree induction

    Mach. Learn.

    (1992)
  • D.R. Cox

    Regression models and life tables

    J. R. Stat. Soc.

    (1972)
  • F. D’alche-Buc et al.

    Trio learning: a new strategy for building hybrid neural trees

    Neural Syst.

    (1994)
  • Duda RO, Hart PE. Pattern classification and scene analysis. New York: Wiley;...
  • Cited by (182)

    • Practical Machine Learning for Data Analysis Using Python

      2020, Practical Machine Learning for Data Analysis Using Python
    View all citing articles on Scopus
    View full text