A combined neural network and decision trees model for prognosis of breast cancer relapse
Introduction
Prediction tasks are among the most interesting activities in which to implement intelligent systems. Specifically, prediction is an attempt to accurately forecast the outcome of a specific situation, using as input information obtained from a concrete set of variables that potentially describe the situation.
A problem often faced in clinical medicine is how to reach a conclusion about the prognosis of cancer patients when presented with complex clinical and prognostic information, since specialists usually make decisions based on a simple dichotomization of variables into a favourable and unfavourable classification [18]. As we enter the new millennium, treatment modalities exist for many solid tumour types and their use is well established. Nevertheless, offset against this is the toxicity of some treatments. As there is a real risk of mortality associated with treatment, it is vital to have the possibility of offering different therapies depending on the patients. In this sense, the likelihood that the patient will suffer a recurrence of her disease is very important, so that the risks and expected benefits of specific therapies can be compared.
This work analyses, on the one hand, the decision-making process existing when patients with primary breast cancer should receive a certain therapy to remove the primary tumour. On the other hand, different prognostic factors appear to be significant predictors for overall survival, but probably form part of a bigger picture comprising many, inter-related factors [11]. In order to investigate this hypothesis, studies looking at a large number of potential prognostic factors are needed. To further complicate matters, these relationships may well be non-linear in nature. These form the major difficulties in such studies. Furthermore, the statistical analysis of large datasets using standard methodologies is cumbersome and limited, especially in the case of non-linear relationships.
Among prognostic modelling techniques that induce models from medical data, survival analysis methods are specific both in terms of modelling and the type of data required. Survival models attempt to determine the probability of the event occurring within a specific time, which requires classification models that classify either the occurrence or non-occurrence of the event and optionally model the outcome probabilities. Several tools successfully used in the construction of medical prognosis models have been proposed by the machine learning community [17], [34].
Neural networks are a form of artificial intelligence that have found application in a wide range of problems [10], [20], [24] and have given, in many cases, superior results to standard statistical models [33]. Baxt [4] demonstrated the predictive reliability of an artificial neural networks model in medical diagnosis. In this case, we utilise the ability of neural networks to recognise complex and highly non-linear relationships, such as are likely to characterise medical circumstances.
Some authors [14], [30] have modelled systems for outcome prediction in post-surgery breast and lung carcinoma patients using neural networks to perform survival analysis. This type of modelling manages the problem of censored data handling that arises when the event related to the censor variable—normally included in the survival data (like death or recurrence of a disease)—has not occurred during the follow-up period for a patient, although the event may eventually occur. These authors have solved the problem by using different survival estimators to handle censored data for patients. This would imply that prognostic factors—for example, in breast cancer with adjuvant therapy after surgery—are not time-dependent, but this is not really true. That is, the strength of the prognostic factor is not the same for different time intervals. Different techniques for survival estimation, such as Kaplan–Meier analysis [15] and Cox Regression modelling [6] assume that the strength of a prognostic factor does not change over time. In addition, the existence of a “peak” of recurrence in the distribution of relapse probability [2] demonstrates that the recurrence probability is not the same over time. In this sense, if these statistical techniques are not appropriate to solve this problem, a possible solution would be to incorporate the whole set of prognostic factors pre-selected by medical experts (Section 3.1) as input to the neural networks system. This would involve removing all the patients with censor data; however, the cardinality of the resulting patient data vectors set would then become too small to constitute a significant representation of this problem.
This work proposes a new system approach based on: (1) specific topologies of neural networks for different time intervals during the follow-up time of the patients, considering the events occurring in different intervals as different problems; and (2) decision trees, useful in understanding the underlying relationships in breast cancer data, for selecting the most important prognostic factors corresponding to every time interval. This is not the first attempt to combine decision trees and neural networks [1], [7], but it does present different ways of integrating them.
In addition, we introduce a new decision trees algorithm, control of induction by sample division method (CIDIM), for reducing the number of rules and improving the selection of attributes from the database to become significant prognostic factors. Furthermore, a new upper-bound estimate of the problem-difficulty level, based on the correct classification Bayes’ probability, is also proposed.
Section snippets
Breast cancer overview
Breast cancer is a malignant tumour that has developed from cells of the breast. Although scientists know some of the risk factors (i.e. ageing, genetic risk factors, family history, menstrual periods, not having children, obesity) that increase a woman’s chance of developing breast cancer, they do not yet know what causes most breast cancers or exactly how some of these risk factors cause cells to become cancerous. Research is under way to learn more and scientists are making great progress in
Patient data
Data from 1035 patients with breast cancer disease from the Medical Oncology Service of the Hospital Clı́nico Universitario of Málaga, Spain were collected and recorded during the period 1990–2000. Data corresponding to every patient were structured in 85 fields containing information about post-surgical measurements, personal data, and type of treatment. Part of this information regarding patients is not relevant for predicting outcome, so that only 14 independent input variables—pre-selected
Results and discussion
Table 6 shows the number of patients and the selection of prognostic factors corresponding to every time interval (in months) of patients’ follow-up that were selected for training the neural networks system. After processing the patient database through the decision trees system (CIDIM algorithm), certain attributes appear to be the most significant prognostic factors (second column in Table 6) becoming the input to the artificial neural networks system. The decision trees system makes the
Conclusions
This paper presents a decision-support tool for the prognosis of breast cancer relapse using clinical–pathological data. We propose a model that combines a novel algorithm TDIDT (CIDIM), with a system composed of different neural network topologies to approximate Bayes’ optimal error for the prediction of patient relapse after breast cancer surgery. The CIDIM algorithm selects the most relevant prognostic factors for the accurate prognosis of breast cancer, while the neural networks system
Acknowledgements
We would like to thank the referees for their valuable comments and suggestions, and also the Oncology Service staff of the Hospital Clı́nico Universitario of Málaga for their comments and collaboration in this work. This work has been partially supported by the FRESCO project, number PB98-0937-C04-01, of CICYT Spain.
References (34)
Application of neural networks to clinical medicine
Lancet
(1995)Multilayer neural networks and Bayes decision theory
Neural Networks
(1998)- et al.
Analysis of hidden units in a layered network trained to classify sonar targets
Neural Networks
(1988) - et al.
Artificial neural networks: a new model for assessing prognostic factors
Ann. Oncol.
(2000) - et al.
Experiments to determine whether recursive partitioning (cart) or an artificial neural network overcomes theoretical limitations of Cox proportional hazards regression
Comput Biomed Res
(1998) - et al.
Prognostic methods in medicine
Artif Intell Med
(1999) - et al.
Comparison of different neural networks algorithms in the diagnosis of acute apendicitis
Int J Biomed Comput
(1996) - et al.
Treatment of missing data values in a neural network based decision support system for acute abdominal pain
Artif Intell Med
(1998) - et al.
Predicting the secondary structure of globular proteins using neural network models
J Mol Biol
(1988) - et al.
Machine learning for survival analysis: a case study on recurrence of prostate cancer
Artif Intell Med
(2000)
C-Net: a method for generating non-deterministic and dynamic multivariate decision trees
Know. Inform. Syst.
Statistical theory of overtraining—is cross-validation asymptotically effective?
Adv. Neural Inform. Process. Syst.
A further comparison of splitting rules for decision-tree induction
Mach. Learn.
Regression models and life tables
J. R. Stat. Soc.
Trio learning: a new strategy for building hybrid neural trees
Neural Syst.
Cited by (182)
Taxonomy of hybrid architectures involving rule-based reasoning and machine learning in clinical decision systems: A scoping review
2023, Journal of Biomedical InformaticsBreast tumor localization and segmentation using machine learning techniques: Overview of datasets, findings, and methods
2023, Computers in Biology and MedicineImproving decision making in the management of hospital readmissions using modern survival analysis techniques
2022, Decision Support SystemsQuantitative sleep EEG synchronization analysis for automatic arousals detection
2020, Biomedical Signal Processing and ControlPractical Machine Learning for Data Analysis Using Python
2020, Practical Machine Learning for Data Analysis Using Python