Data
Cases were selected for the TCGA project based on patient consent and the availability of adequate tissue for the intensive planned mutational analyses. Thus the selection of cases cannot be considered representative of all diagnosed cases and may result in a preponderance of features characteristic of more advanced cases with larger tumors or may under-represent metastatic cases that frequently do not undergo nephrectomy.
We elected to focus on four distinct genomic platforms: mRNA, copy number, methylation and mutation. mRNA expression results were generated from the Illumina HiSeq platform. We used normalized log counts and filtered out genes with low expression (median <5 counts) and low variability (MAD < 1.25), following standard practice of TCGA investigations leaving 1267 genes from the original panel of 20531. Methylation data were generated from the Illumina 27k and 450k panels as described previously [
8]. A total of 25014 probes were examined, with the sex chromosome excluded. Data were standardized across samples and within platform and merged, and the top 1000 most variable probes selected for analysis. Copy number data were derived from Affymetrix SNP 6.0 arrays. We used a reduction parameter (ϵ = 0.001) to obtain a total of 2312 regions, and our data comprise the segment means from each region. These filtering approaches are based on the premise that about 1000 probes is sufficient to capture any relevant structure in the data, while the addition of more probes, especially those with low signal or low variance, is likely to add noise. Mutation data were obtained from the supplementary files of the original publication of the TCGA without any additional processing [
10].
Risk factor data were obtained from the medical records. We obtained information on smoking status at diagnosis (current, former, never smoker), body mass index (BMI) categorized in accordance with World Health Organization criteria (<25, 25-29, 30+ kg/m
2) and lifetime history of hypertension (yes, no), all of which are established risk factors for kidney cancer [
11‐
14]. In addition we include age and gender, since cancer incidence in general is influenced by both of these factors. Instructions for how to reconstruct the data are provided in Additional file
1 Supplementary Materials (Data Archive).
Analytic framework
Details of our general analytic strategy were explained in a previous article [
7]. In the following we summarize the essential conceptual features of the approach, and some modifications we have made to suit the nature of the TGCA data, namely the extensiveness of the genomic profiling and the fact that the study is limited to cases with cancer but not healthy controls. Our primary goal is to identify tumor sub-types that are etiologically distinct. To accomplish this we use a hybrid clustering strategy that employs classical k-means clustering using the genomic profiles of the tumors to identify candidate solutions. K-means clustering endeavors to find the set of clusters that maximizes the weighted Euclidean distance between the clusters using the inter-cluster dissimilarity, denoted by G, as the distance measure. Because of the complexity of identifying the maximum of a scalar function in multi-dimensional space k-means clustering from an initial random seed inevitably reaches a local maximum rather than the global maximum. Thus the method involves repeated maximization using different random seeds, with the maximum of the various local maxima chosen as the ultimate solution. In our approach, rather than choosing the solution with the highest value of G, for each local maximum we calculate a measure of etiologic heterogeneity and choose the solution with the highest value of this measure. We used 10,000 k-means runs for this purpose. Empirically the individual values of the clustering measure identified (defined below) were each observed sufficiently frequently that we are confident we did not fail to identify the maximum. Each clustering analysis involves initial specification of the number of clusters. That is, we perform an analysis based on the assumption that there exist 2 clusters, then we perform an analysis based on 3 clusters, and so forth.
Our measure of etiologic heterogeneity is based on two related concepts. The first is that in studying risk factors we desire to maximize the predictability of disease occurrence in individuals, and that a useful measure of predictability is the extent of variation of the risks of individuals in the population. That is, the more widely varying the individual risks, the more easily we are able to predict the disease. We use for this purpose the coefficient of variation of disease risks, denoted by K, a measure that aggregates the relative contributions of individual risk factors. In any disease sub-type the corresponding coefficient of variation of the risks of the sub-type is denoted K
j for sub-type j. That is, if r
i is the overall disease risk for the i
th individual and r
ji is the corresponding risk of sub-type j, then
Κ =
v
1/2/
μ and
where
and where n is the number of subjects in the population at risk. The etiologic heterogeneity of sub-types can be characterized by the correlations of the risks of the individual sub-types, with low (or negative) correlation representing high degrees of heterogeneity. Thus the coefficients of covariation,
Κ
jk
=
c
jk
/
μ
j
μ
k
, where
c
jk
=
n
- 1∑
r
ji
r
ki
-
μ
j
μ
k
, reflect (inversely) the degrees of etiologic heterogeneity between pairs of sub-types. The second concept is that increasing etiologic heterogeneity between sub-types inevitably increases the collective risk predictability within sub-types. Thus by using a measure of incremental risk prediction denoted by
(1)
where
π
1,
π
2, …
π
m
represent the proportions of cases in each of m sub-types, we are able to choose sets of sub-types that maximize the extent to which the average risk predictability of the set of sub-types (the term in parentheses) exceeds the risk predictability of the disease as a unitary entity (as represented by K
2), and by so doing we also maximize the collective etiologic heterogeneity of the sub-types. This can be seen by observing that D can also be written in the following way, showing that it increases with decreasing values of the covariances:-
(2)
where the summation extends to all pairs of sub-types.
To calculate the various coefficients of variation and covariation one needs to obtain risk predictors for each sub-type for each case. In the context of a case-control study these can be obtained from polytomous logistic regression of the sub-types on the risk factors, as described in our previous work [
7]. However, the kidney TCGA dataset contains only cases, with no disease-free controls. The case-only design permits estimation of the ratios of the relative risks of the different sub-types for any subject but does not permit estimation of the relative risk of disease itself [
15]. However, we can calculate an approximation to D, denoted D
*, that captures the essential features of the heterogeneity signal as follows.
The preceding formulas (1) and (2) represent averages with respect to the population at risk. Since the controls in a case-control study represent the population at risk the variance and covariance components of the formulas must be estimated by averaging over the controls. In a case-only study we can only calculate such terms using cases, and so corresponding summation terms represent averages over the population distribution of cases. Cases occur based on risk-biased sampling from the population at risk, and so the various terms we use in calculating our measure of etiologic heterogeneity are averaged with respect to this risk biased sample. Risk biased sampling means that individuals become cases in direct proportion to the individual’s risk. Consequently to deconvolute the distribution of risks obtained from a sample of cases in order to equate it with the corresponding distribution from controls one would have to reweight each case in inverse proportion to its risk, i.e. the i
th case must be reweighted by the factor
μ/
r
i
, relative risks that are not estimable in the setting of a case-only study. In the absence of controls we must simply estimate the variance and covariance terms that comprise D using the cases. To see the impact of this we make use of the fact that D can be re-expressed in terms of individual, case-specific deviations of the sub-type probabilities from their overall relative frequencies as follows:-
(3)
where
u
ji
=
r
ji
/r
i
represents the conditional probability that the i
th case belongs to the j
th sub-type. The last term in parentheses represents the deviation of the sub-type probabilities for the i
th case for the j
th and k
th sub-types. Greater etiologic heterogeneity is reflected by larger values of these deviations. If we simply use cases to estimate the variances and covariances that comprise D in (1) then we are in effect estimating D
*, where
(4)
That is, the contributions of individual cases are additionally weighted in proportion to their risks via the terms {r
i
/μ}. The effect of this change will be to give greater weight in (4) to risk strata with higher risks and correspondingly lesser weight to risk strata with lower risk. We cannot compare these terms empirically since we have no controls, but it is clear that the impact of the difference will be minimal unless there is both a very broad range of individual risks, and a trend for the “outliers” to occur preferentially at one end of the risk scale. Moreover, the goal of our analysis is not to evaluate the absolute magnitude of D. It is to use relative values of D to rank different sub-typing options to determine which ones exhibit the greatest degrees of etiologic heterogeneity. Intuitively the rankings of D and D* are likely to be very similar in practice, even in the presence of broad variation in the underlying risks.
We evaluated the statistical significance of the hypothesis that heterogeneity exists in the data in the following way. We determined the value of D* from the optimal 2-class system and compared this with a reference distribution in which the sample labels were permuted 1000 times and D* recalculated for the new dataset. Permutation of the sample labels ensures that the genomic profiles are randomly paired with the risk factor profiles, defining the absence of a true signal. Determination of the correct number of sub-types is a challenge in any clustering context but it is especially challenging in this context. Here we chose to use the difference in the optimal D* values for the numbers of sub-types being compared, e.g. in determining whether 3 sub-types reveals significant additional heterogeneity to 2 sub-types we subtracted the optimal D* for the 2-class analysis from the optimal value for the 3-class analysis. We generated a reference distribution by permuting the sample labels, calculating the optimal 3-class and 2-class solutions, calculating the difference, and repeating the process 1000 times.
Our investigation is exploratory. Since genomic data are so voluminous and we have results from multiple platforms we approached the analysis with some specific questions in mind, to provide structure to our analysis and to enhance our confidence in any interesting observations. First we performed the preceding clustering analysis separately for each of 3 platforms: mRNA, methylation and copy number arrays. Then we attempted to address the following questions:- Do any of the identified sub-types possess a distinctive risk profile? Are any sub-types determined from mRNA, methylation or copy number data characterized by distinct mutational profiles or genetic pathways? Do the individual sub-types have distinctive clinical characteristics? Are the different genomic platforms congruent with respect to sub-types identified?
To address the involvement of genetic pathways a gene set enrichment analysis was conducted. We obtained a pre-defined collection of pathway gene sets from the Molecular Signatures Database (MSigDB database v4.0) and the database for Annotation, Visualization, and Integrated Discovery (DAVID). We conducted a gene set enrichment analysis for each of the subtypes for each of the platforms. Specifically, for each platform we first calculated a t-statistic for each gene j comparing samples in sub-type k (k = 1,..,4) versus the remaining sub-types. Genes were ranked based on these scores. Then for each gene set S, a Wilcoxon rank sum test was used to compare the ranks of genes in the pathway (j ∈ S) versus their complement (j ∉ S). In this way we calculated a separate enrichment p-value for each pathway in each of the four subtypes. This can be considered a competitive test in the nomenclature of Goeman and Buelmann in that the Wilcoxon test statistic assesses whether the frequency of differential expression differs for pathway genes versus non-pathway genes [
16].