Elsevier

The Lancet

Volume 361, Issue 9369, 10 May 2003, Pages 1590-1596
The Lancet

Articles
Gene expression predictors of breast cancer outcomes

https://doi.org/10.1016/S0140-6736(03)13308-9Get rights and content

Summary

Background

Correlation of risk factors with genomic data promises to provide specific treatment for individual patients, and needs interpretation of complex, multivariate patterns in gene expression data, as well as assessment of their ability to improve clinical predictions. We aimed to predict nodal metastatic states and relapse for breast cancer patients.

Methods

We analysed DNA microarray data from samples of primary breast tumours, using non-linear statistical analyses to assess multiple patterns of interactions of groups of genes that have predictive value for the individual patient, with respect to lymph node metastasis and cancer recurrence.

Findings

We identified aggregate patterns of gene expression (metagenes) that associate with lymph node status and recurrence, and that are capable of predicting outcomes in individual patients with about 90% accuracy. The metagenes defined distinct groups of genes, suggesting different biological processes underlying these two characteristics of breast cancer. Initial external validation came from similarly accurate predictions of nodal status of a small sample in a distinct population.

Interpretation

Multiple aggregate measures of profiles of gene expression define valuable predictive associations with lymph node metastasis and disease recurrence for individual patients. Gene expression data have the potential to aid accurate, individualised, prognosis. Importantly, these data are assessed in terms of precise numerical predictions, with ranges of probabilities of outcome. Precise and statistically valid assessments of risks specific for patients, will ultimately be of most value to clinicians faced with treatment decisions.

Introduction

Calibration of therapeutic intervention for an individual's outlook is central to effective oncological treatment. In breast cancer, invasion into axillary lymph nodes is the most important prognostic factor.1, 2 Dissection of axillary nodes is therefore crucial in therapeutic decision making. New, less invasive, methods for assessment of lymph node status—such as sentinel node biopsy—are gaining acceptance,1 but clinico-pathological indices, such as the presence or absence of positive axillary nodes, remain the best way to classify patients into broad subgroups by recurrence and survival.3, 4, 5 In patients with no detectable lymph-node involvement, a population thought to be at low-risk, between 22% and 33% develop recurrent disease after a 10-year follow-up.6 Identification of individuals in this group who are at risk for recurrence cannot be done at present.

Diagnosis of lymph-node status is important in accurate prediction of disease course and recurrence of breast cancer. Although clinical predictors are useful, they are not accurate enough for prediction in the individual patient. Genomic measures of gene expression provide new information to identify patterns of gene activity that subclassify tumours.7, 8, 9, 10 Such patterns might correlate with biological and clinical properties of the tumours, so we could usefully investigate whether, and how, such data might add predictive value to clinical predictors. Credible assessment of predictors is critical to establish reproducible results, and a key step towards integration of complex genomic data into outlook for individual patients.11, 12, 13, 14

Here, we move towards this goal by looking at gene expression patterns that predict involvement of the lymph node and recurrence of breast cancer in defined patient subgroups. We focus on predictions for the individual patient and aim to provide quantitative measures—in respect of probabilities of clinical phenotype and disease outcome—that summarise the genomic information relevant to such prediction.

Section snippets

Procedures

The analyses detailed here comply with MIAME (minimal information about a microarray experiment)-guidelines established by the Microarray gene expression data society (www.mged.org). The analysis used 89 tumour samples for comparative measurements of gene expression. Our goal was to identify gene expression patterns that are characteristic of particular sets of tumour samples within the group. These samples represent a heterogeneous population, and were selected on the basis of clinical

Statistical analysis

Our analysis used predictive statistical tree models (unpublished). This model begins by applying k-means correlation-based clustering, after an initial screen to remove genes that show little variation, targeting numerous clusters that generate a corresponding number of metagene patterns. Every metagene is the dominant single factor (principal component) within a cluster, as assessed by the singular value decomposition. We identified 496 such factors this way, each representing the key common

Results

In our previous study we compared low-risk versus high-risk patients, mainly based on lymph node status, for assessment of the predictive associations of gene expression patterns with aggressive versus benign tumours. In oestrogen receptor (ER) positive individuals the high-risk clinical profile is represented by advanced lymph node metastases (ten or more positive nodes); the low-risk profile identifies node negative women older than 40 years of age, with tumour size less than 2 cm. These

Discussion

Our assessment of complex, multivariate patterns in gene-expression data from primary tumour biopsy specimens, and examination of the value of such patterns in prediction of lymph-node metastasis and relapse resulted in a predictive accuracy of about 90%. The analysis provided additional understanding of individual outcomes and confirmed the use of gene expression patterns as prognostic factors in breast cancer. The group analysis of lymph-node risk defines metagene patterns that can accurately

GLOSSARY

k-means clustering
Standard clustering of genes into a number, (k) of separate groups. Correlation-based k-means defines groups to include genes most highly related in terms of correlation calculated across samples.
singular value decomposition
A mathematical procedure by which trends in large datasets can be noted.
gene-specific noise
Variation in measurements of gene expression that is not due to underlying causes across a set of genes, and mainly relate to experimental and processing errors.

References (17)

There are more references available in the full text version of this article.

Cited by (545)

View all citing articles on Scopus
View full text