Main

The Gleason grading of prostate cancer has been established for over 40 years (Gleason, 1966). Although the basic grading categories have remained unchanged in this time, there have been numerous changes in the methodologies used to determine the Gleason score (GS) of prostate cancer over that period.

Changes were first introduced in the 1970s (Gleason and Melliger, 1974). The advent of immunochemistry for basal cell markers in the 1990s introduced further upward shifts in Gleason grading as it was realised that many low-grade lesions diagnosed as prostate cancer were benign lesions such as atypical adenomatous hyperplasia (Bostwick and Chang, 1999; Berney et al, 2007). A number of authors cautioned on the diagnosis of very low GSs (Epstein, 2000; Berney, 2007), and this was codified in the 2005 consensus meeting of the International Society of Urological Pathology (ISUP) 2005 (Epstein et al, 2005), where it was recommended that scores <6 ‘should rarely if ever’ be made. More recently, at the 2014 ISUP Chicago conference it was agreed that GSs 2–4 ‘should not be made’ on biopsy (Epstein et al, 2016). Although no statements were made concerning GS 5 (3+2 or 2+3), this score is currently also rarely assigned on biopsy.

There have been further debates since then on unresolved issues on Gleason grading. It has been shown in numerous studies that GS 3+3=6 tumours show little propensity to recur or metastasise, when completely resected by radical prostatectomy (Miyamoto et al, 2009; Ross et al, 2012). However, as biopsy specimens remain samples of the tumour, there remains a degree of uncertainty on whether there is un-sampled higher-grade tumour present whenever GS 3+3=6 is diagnosed.

GS ranges from 2 to 10, but the fact that 6 is the lowest practicable score is very confusing for clinicians and patients (Berney, 2007).

A further concern is that GS 3+4=7 and GS 4+3=7 are not separated in most current prognostic tools, although many studies have shown the differences in these scores to be prognostically significant.

In an era when active surveillance is increasingly offered to patients with low-risk prostate cancer, a revision to prostate cancer grading has been proposed (Pierorazio et al, 2013) based on five grade groups. This has been accepted by a meeting of senior uropathologists, oncologists and surgeons at an ISUP conference in Chicago in 2014 (Epstein et al, 2016). The correlation of GS and grade groups is shown in Table 1.

Table 1 A detailed comparison of contemporary Gleason scoring and grade groups

This grading system has been validated using biochemical relapse as an outcome in a large international series of radical prostatectomy patients (Epstein et al, 2015). However, it has not been validated in a conservatively treated cohort, with prostate cancer death at the end point.

There are other crucial refinements in the interpretation of prostate cancer grading, which need to be clarified for use by clinicians and pathologists. There have been some changes to the pattern assignments seen. Cribriform glands and glomeruloid glands, it has been agreed, should all be given a Gleason pattern of four in line with a number of separate lines of evidence on cribriform (Martinez-Rodriguez et al, 2007; Dong et al, 2013; Kir et al, 2014; Kweldam et al, 2014; van der Kwast, 2014) and glomeruloid patterns (Pacelli et al, 1998; Gobbi et al, 1999; Lotan and Epstein, 2009; Liu, Chang et al, 2011). There has been debate on whether the ‘worst’ score seen in a single core of a biopsy series is more or less predictive of outcome than an ‘overall’ score judged by the pathologist after reviewing the whole series (Kunz and Epstein, 2003; Kunju et al, 2009; Tolonen et al, 2011). Both ‘worst’ or ‘overall’ score are used throughout Europe in pathology practice (Berney et al, 2013), although typically the highest score is used by clinicians (Rubin et al, 2004).

In this study, we examine the proposed changes in the grading of prostate cancer in a biopsy series treated conservatively and re-reviewed to these new standards. We investigate whether this new grading system can be applied to this data set and whether ‘overall’ or ‘worst’ score best predicts prostate cancer death.

Materials and methods

Patients

Cases of prostate cancer were identified from three cancer registries in Great Britain. Within each region, collaborating hospitals were sought and cases from these hospitals were reviewed. Men were included in this study if they were under age 76 years at the date of diagnosis and had clinically localised prostate cancer diagnosed by needle biopsy between 1990 and 2003 inclusively. The median date of diagnosis was May 2002. Patients treated by radical prostatectomy or radiation therapy within 6 months of diagnosis were excluded. In addition, those with objective evidence of metastatic disease (by bone scan, X-ray, radiograph, CT scan, MRI, bone biopsy, lymph node biopsy and pelvic lymph node dissection) or clinical indications of metastatic disease (including pathologic fracture, soft-tissue metastases, spinal compression, or bone pain), or a PSA measurement over 100 ng ml−1 at or within 6 months of diagnosis were also excluded. Men who had hormone therapy before the diagnostic biopsy were also excluded, because of the influence of hormone treatment on Gleason pattern. We also excluded men who died within 6 months of diagnosis, or had <6 months of follow-up.

Original histological specimens from the diagnostic procedure were requested and centrally reviewed by a panel of three expert urological pathologists to confirm the diagnosis of adenocarcinoma and to reassign GSs using of a contemporary and consistent interpretation of the Gleason scoring system (Epstein, 2010). The panel met and discussed all controversial cases and a selection of others to audit the data set. Cribriform and glomeruloid glands were all assigned a Gleason pattern 4. All the cores in each case were given a separate score, and an overall score for the case was also given based on the opinion of the pathologist for each case. Overall grading was assigned by the opinion of the pathologist and the methodology agreed in consensus before analysis. The method chosen was to assign an overall grade thought to be the best estimate of what would be seen at radical prostatectomy. For instance, in a biopsy series with numerous cores with Gleason 4+3=7 and a small amount of Gleason 4+4=8 or even higher in a single core, the pathologist might judge that Gleason 4+3=7 was a more representative score. It was also taken into account that tiny amounts of pattern 5 carcinoma are not included in the grading of radical prostatectomy specimens but given a tertiary score. Percentages of each pattern seen were given. Follow-up was conducted through the cancer registries and the cut-off date was 31 December 2012. Deaths were divided into those from prostate cancer and those from other causes, according to World Health Organisation standardised criteria (WHO, 2010). National ethics approval was obtained from the Northern Multicentre Research Ethics Committee, followed by local ethics committee approval at each of the collaborating hospitals.

Statistical analysis

Survival was analysed with a Cox proportional hazards model. The primary end point was death from prostate cancer. Observations were censored on the date of last follow-up, or at death from other causes. All events were used for estimating hazard ratios (maximum follow-up 232 months), but follow-up was censored at 10 years for predicting 10-year risks. Covariates evaluated were: centrally reviewed overall and worst GS, baseline PSA value, clinical stage, extent of disease (proportion of positive cores), age at diagnosis and use of hormone treatment. Analysis was repeated substituting ‘worst’ GS for ‘overall’ GS and analysed according to the five grade groups.

Baseline PSA concentration was defined as the last pre-diagnostic PSA measurement within 6 months before diagnosis. If no such PSA value was available, we took the first post-diagnostic PSA within 6 months; failing that, the pre-diagnostic PSA taken closest to the date of diagnosis was used. All PSA values after treatment with hormones or orchiectomy or within 3 weeks after a surgical procedure to the prostate were excluded.

PSA concentration was modelled as the natural logarithm of (1+PSA (ng ml−1)). Patients with values >100 ng ml−1 were excluded as likely to be metastatic disease. GSs were evaluated in five prognostic grade categories by ‘worst’ GS and ‘overall’ GS.

The primary assessment was a univariate analysis of the association between grade group by overall GS and death from prostate cancer and repeated for ‘worst’ GS. Statistical analyses were done with STATA (version 12, StataCorp, College Station, TX, USA) and R (version 3.0, The R Foundation for Statistical Computing, Vienna, Austria). Multivariate analysis included clinical T stage, diagnostic serum PSA and the volume of disease (percentage of involved cores), and method of treatment (initial hormone treatment or no initial hormonal treatment).

Results

Six thousand five-hundred and one cores from 988 individual cases were assessed for malignancy and graded. The mean, median and interquartile range of patient age, number of cores sampled, serum PSA and percentage of cores involved is shown in Table 2. Cases were divided into the 5 prognostic grade groups from the GS and a comparison between the prognostic grade groups using both ‘worst’ and ‘overall’ GS is seen in Figure 1.

Table 2 Distribution of the mean, median and interquartile range of patient age, serum PSA, number of cores sampled and percentage of cores involved by tumour across the grade groups
Figure 1
figure 1

Comparison of overall and worst Grade Group frequencies.

Both ‘overall’ and ‘worst’ GS analysis yielded highly significant results. The significance of log rank for overall GS in five grade groups was P=2.79 × 10−26 (χ2=126 df=4). For the worst GS this was P=1.43 × 10−24 (118 χ2 df=4) with overall GS, therefore, slightly but insignificantly outperforming worst GS. It should be noted that GS 3+4=7 (grade group 2) separated highly significantly from GS 4+3=7 (grade group 3). Cox model analysis with hazard ratios by both overall and worst grade group seen also showed high levels of significance (Table 3 and Figure 2). Out of 988 patients, 574 received early hormonal therapy, whereas 414 received watchful waiting only as initial treatment. When analysed separately using overall assessments of grade group for the early hormone-treated group, P=2.85 × 10−12 (χ2=60 df=4), whereas for the non-hormone-treated group, P=1.05 × 10−5 (χ2=23.4 df=4).

Table 3 Cox Model analysis with hazard ratios by overall and worst grade groups–estimates compared with reference Grade group 1 (GS 3+3=6)
Figure 2
figure 2

Kaplan–Meier plots of the 5 Grade Groups by worst Grade Group (A) and overall Grade Group (B).

On multivariate analysis in comparison with log PSA, extent of disease (percentage of involved cores),T stage (stages 3 and 4 merged) and including the method of initial treatment, grade group remained significant with a χ2 (4df) of 10.3 for overall grade and 9.2 for worst grade. (Table 4), A complete data set was available on 755 patients, with some patients missing details of clinical stage. For the multivariate Cox models, the Harrell c-statistic for overall grade is 0.756 (se=0.028) and for worst grade is 0.752 (se=0.028).

Table 4 Multivariate analysis of overall and worst grade group including stage, PSA, initial treatment method and extent of disease. (% disease expressed as % of involved biopsy cores)

Removal of extent of disease from the multivariate model, (which was of low significance) resulted in in an increase in log PSA significance with a higher hazard ratio (1.36) and more significant P-value (0.010) and on tumour stage 3/4 vs 1 with a higher hazard ratio (2.30) and more significant P-value (0.010) with similar changes in the worst grade multivariate model; (log PSA hazard ratio=1.37, P=0.008) and (tumour stage hazard ratio=2.46, P=0.010).

Discussion

These results show, for the first time, that in a conservatively treated cohort with prostate cancer death as an outcome, interpretation of GS using modern criteria can effectively separate five prognostic grade groups. The power of grade groups to predict outcome in this cohort is considerable. It shows that modern interpretation of GS is not only valid using pathological surrogates for outcome or biochemical recurrence but indicates that it correlates with prostate cancer death. We also suggest that grade groups, as suggested in other papers, can be confidently used in reports alongside GS. This will aid both clinicians and patients in their understanding of the severity of the cancer and aid treatment decisions and counselling for active surveillance patients. Gleason scoring presents a ‘skewed’ scale to patients, with a scale running from 2 to 10, when the lowest valid score is 6. Explaining to patients that a GS 6 cancer is low risk can be difficult. Translation of this to ‘grade group 1’ will be easier for patients to understand, and for clinicians to explain (Berney, 2007).

It should be especially noted that there is a significant split between GS 3+4=7 and GS 4+3=7, (grade groups 2 and 3), which has not been well translated in previous risk assessments such as CAPRA (May et al, 2007; Lughezzani et al, 2010).

The least significant separation is between GC 4+3=7 and 4+4=8, and requires further investigation. Certainly, minor elements of pattern 3 cancer seem to matter little in overall prognosis.

The use of an ‘overall’ or ‘worst’ score has been considerably debated in the literature (Kunju et al, 2009; Tolonen et al, 2011). There is great variability in how GS is assigned in different centres. Some have advocated assigning a GS to every core and giving no ‘overall’ score for the case. Other pathologists give a GS per submitted specimen pot: which might include more than one core (Berney et al, 2013). There have been no direct comparisons of the different methods in a series of conservatively treated prostate carcinomas with long-term outcome. There is a concern that a ‘worst’ GS might overstate the severity of the disease, especially when the volume of high-grade disease in a single core is small and there is widespread disease of a lower grade in other cores.

We have shown here that the ‘worst’ GS has a very similar prognostic ability to an ‘overall’ GS. As it is easier to calculate and relies less on the subjectivity of individual consultant pathologists we advocate its use in routine practice. Using the ‘worst’ GS, there appears to be greater separation of grade groups 3 and 4. Also the ‘worst’ GS was used in both the initial and validating studies of grade groups which showed significant differences between grade groups 3 and 4 (Pierorazio et al, 2013; Epstein et al, 2015). Also, the ‘rules’ for assigning an overall GS are not clear, and prone to variation between pathologists.

The strengths of this study include the large sample size and detailed nature of the centralised pathological review. In many series it is unclear whether individual cores have been separately graded, especially when they are processed within one cassette or slide.

The weaknesses of the study include its retrospective nature, and the criticism that prostate cancer is no longer treated in the same manner as it was 20 years ago. The majority of the cohort is from sextant biopsies, which is not contemporary practice. This is an unavoidable weakness of current retrospective studies to allow sufficient follow-up to look at prostate cancer death as an outcome. This is a problem for all current long-term studies of prostate cancer outcome, and can also be levelled at large trials such as PROTECT (Oxley et al, 2015), where the methods of biopsy are not now standard of care. This will be an on-going problem in prostate cancer outcome studies, with the continuing advance of imaging and template biopsy techniques. For the foreseeable future, pathological grading of prostate cancer will remain standard of care, and adjuvant techniques such as imaging or molecular pathology which are complementary are unlikely to take over from the current gold standard.

In conclusion, we have validated five grade groups in a biopsy series of prostate cancer using prostate cancer death as an outcome. This study compliments other studies using PSA relapse as an outcome for the use of this system internationally.