Data
The ICARE study, conducted from 2001 to 2007, is a large multicentre population-based case-control study of respiratory cancers. The study was approved by the Institutional Review Board of the French National Institute of Health and Medical Research (IRB-Inserm, no. 01-036), and by the French Data Protection Authority (CNIL no. 90120). Each subject gave a written and informed consent. In order to protect the confidentiality of personal data and to fulfil legal requirements, the questionnaire included only an identification number, without any nominative information. The same identification number was used for biological specimen. The link between the name and the identification number (to the exclusion of any other data) was kept by the cancer registry of the area where the subject was interviewed. Further details have been described previously [
13].
The present analysis focused on men with lung cancer and their population controls, restricting the dataset to 4,658 males with full smoking histories. Of these 1,995 are cases. We also separately consider the male dataset stratified by histological cell type. For the histological analyses, we use all the controls, but only the relevant cases, resulting in datasets of size 3,365 for adenocarcinoma (702 cases), 3,359 for squamous (696 cases) and 2,933 for small cell cancers (270 cases). The smoking covariates that we study are: intensity (cigarettes per day), duration (years as a smoker), time since smoking cessation (years) and pack-years. For each covariate we categorise the data into 5 categories (summarised in Table
1), chosen to contain approximately balanced numbers of individuals as well as being easily interpretable.
Table 1
Summary of covariate categories
Average intensity of smoking | 0 | Non-smoker | 823 |
| 1 | 0 < cigarettes per day ≤ 10 | 716 |
| 2 | 10 < cigarettes per day ≤ 20 | 1540 |
| 3 | 20 < cigarettes per day ≤ 30 | 1014 |
| 4 | 30 < cigarettes per day | 550 |
| NA | Not available | 15 |
Duration of smoking | 0 | Non-smoker | 823 |
| 1 | 0 < years ≤ 20 | 972 |
| 2 | 20 < years ≤ 30 | 887 |
| 3 | 30 < years ≤ 40 | 1073 |
| 4 | 40 < years | 903 |
Time since quit smoking | 0 | Non-smoker | 823 |
| 1 | 20 < years | 870 |
| 2 | 10 < years ≤ 20 | 583 |
| 3 | 0 < years ≤ 10 | 996 |
| 4 | Current smoker | 1386 |
Pack-years | 0 | Non-smoker | 823 |
| 1 | 0 < pack-years ≤ 15 | 1089 |
| 2 | 15 < pack-years ≤ 30 | 1043 |
| 3 | 30 < pack-years ≤ 45 | 888 |
| 4 | 45 < pack-years | 800 |
| NA | Not available | 15 |
Within our model we adjust for age, education level, whether the subject has ever worked in a job known to entail exposures associated with lung cancer (i.e. List A [
14]), and for the centre where the data was collected. These adjustments are done by treating the variables as fixed effects as described in the statistical model section below.
Statistical background
In order to explore the associations between smoking characteristics (the covariates) and the risk of lung cancer (the outcome), most common methods attempt to perform a direct regression of the outcome against the covariates. In contrast, our proposed method uses an alternative approach, based upon a statistical mixture model designed to flexibly group individuals into clusters, allowing the clusters to be jointly determined by both covariates and outcomes. By then looking at typical cluster characteristics, in particular the probabilities of covariate values (which we call the profile) for any particular cluster, alongside the average risk of disease for that cluster, we can draw conclusions about patterns within the profile that appear to be related to increased or decreased risk.
As a specific simplified example of how such a model might be used, suppose we fit the model to a subset of the smoking covariates (intensity, duration and time since cessation). In the resulting analysis, imagine that the subjects were split into three clusters. Suppose cluster 1 is identified as having a high risk for the disease, cluster 2 contains subjects at average risk and cluster 3 consists of subjects at low risk. By looking at the average profile in the high risk cluster 1, we might see for example a higher than average probability of being in the highest intensity category, the longest duration category and a raised probability of being a current smoker. Of course, if the method resulted only in such simplified results, this would provide no insight beyond the well known harmful effects of tobacco smoke, but in practice we might hope for a larger number of clusters, covering a range of disease risks, each with different profiles, allowing us to tease out more subtle relationships between covariate combinations and risk.
The underlying clustering model that we use is based on a Dirichlet process (DP) formulation, a well recognized semi-parametric technique that has been extensively studied [
15,
16] and which can be implemented using a Markov chain Monte Carlo (MCMC) algorithm. To formalise the ideas behind the method we employ, consider that we have
N individuals, indexed by
i. For each individual we have an observed disease outcome
y
i
and a covariate profile
x
i
=(
x
i,1,…,
x
i,J
), consisting of the
J covariates that we are interested in studying, where covariate
j is one of
L
j
possible categories.
The model that we adopt is a joint probability model for the outcome
y
i
and profile
X
i
, where for each individual, independent of every other,
(1)
This describes an infinite mixture model, where the weight of mixture component c is given by ψ
c
, and, for each component, the probability models for the outcome y
i
and the profile X
i
are independent, conditional on some component specific parameters Θ
c
and some global parameters Θ
0. In the left hand side we summarise the complete set of parameters as Θ=(Θ
0,ψ
1,Θ
1,ψ
2,Θ
2,…). In order to make inference, it is convenient to introduce the additional allocation parameter Z
i
, with the interpretation that Z
i
=c indicates that individual i is assigned to mixture component c. If the prior allocation probabilities are given by p(Z
i
=c)=ψ
c
, posterior inference on Z=(Z
1,Z
2,…,Z
N
) then provides us with information on the groupings, or clustering, of the individuals.
The mixture weights
ψ={
ψ
c
,
c≥1} are modeled according to a “stick breaking” representation [
17] of a Dirichlet process prior using the following construction. We define a series of independent random variables
V
j
, each having distribution
V
j
∼Beta(1,
α). This generative process is referred to as a stick-breaking formulation since one can think of
V
1 as representing the breakage of a stick of length 1, leaving a remainder of (1−
V
1) and then a proportion
V
2 begin broken off leaving (1−
V
1)(1−
V
2) etc. More details about this construction are given in Additional file
1: Appendix 1 in the supplemental material.
The flexibility of this model is provided by the choices for the response sub-model and the profile sub-model . For the response sub-model, we assume where . Here, θ
c
is the log odds of disease for component c and w
i
are additionally observed fixed effects covariates or confounders for individual i, with regression coefficients β that do not depend upon the mixture component to which individual i is allocated.
For the profile sub-model, conditional upon the allocation Z
i
, we assume independence between covariates, such that where is the vector of probabilities associated with cluster c for each of the L
j
possible categories that could be observed for covariate j.
Together these two sub-models define our component specific parameters Θ
c
=(θ
c
,ϕ
c,1,…,ϕ
c,J
) and the global parameters Θ
0=β.
Adopting a Bayesian perspective allows a natural way for making joint inference on the full set of parameters. Such an approach requires further specification of prior distributions for these parameters. We adopt similar priors to those used by Molitor et al. [
10], using a conjugate approach where possible. A full specification can be found in Additional file
1: Appendix 1.
Inference
Because the posterior distribution resulting from these priors and the likelihood in model (1) is non-standard, we use a simulation based method and an MCMC sampler to make inference. Contrary to standard practice whereby a truncated version of model (1) [
17‐
20] is typically considered, the new sampler (Hastie DI, Liverani S, Papathomas M, Richardson S:
PReMiuM, An R package for Profile Regression Mixture Models using Dirichlet Processes, submitted) that we use here does not require any truncation a priori but relies on the introduction of a latent variable which allows a finite number of clusters to be sampled within each iteration of the sampler as specified for previous samplers of a similar nature [
21‐
23]. This sampler uses a combination of Gibbs and Metropolis-within-Gibbs steps to sample from the infinite mixture (only retaining the parameters of a finite number of mixture components including all those to which individuals are allocated at each sweep). If there are missing values in the profile data, these can also be sampled within the MCMC sampler.
Post-processing
One way to summarise the characteristics of the posterior clustering from an MCMC run is to perform several post-processing steps [
10]. In brief, a dissimilarity matrix is constructed that records
for each pair of individuals the proportion of the MCMC iterations that they were allocated to different mixture components. Partitioning around medoids (PAM) [
24] or using square error distance [
25] is then performed on this dissimilarity matrix to determine a representative clustering. Using this representative clustering, the characteristics of its clusters arise from examining the MCMC output for the relevant parameters [
10].
Any such representation of the rich output of the DP process is necessarily reductive and should not be over-interpreted as it is linked to the chosen way of postprocessing the dissimilarity matrix. Nevertheless, in our case study, we found that it provides a useful representation to understand better what dimension of exposure drives the risk.
Quantifying patterns
Examining the typical profiles of clusters associated with different levels of risk can provide a hypothesis-generating descriptive exploration of potential associations between covariates and link these to the outcome. However, it is also of interest to quantify the roles of specific covariates. Fortunately, with little extra effort, our simulation based method allows such results to be derived, through the use of posterior predictions.
Suppose that we wish to understand the role of a particular covariate or group of covariates. We can specify a number of predictive scenarios (pseudo-profiles), that capture the range of possibilities for the covariates that we are interested in. For each of these pseudo-profiles we can see how these would have been allocated in our mixture model to understand the risk associated with these profiles. More details on the pseudo profiles are available in Additional file
1: Appendix 3.
To illustrate, consider our simple example above, where the smoking covariates under study are intensity, duration and time since cessation. Suppose further that we have a simplified categorical structure for each variable, with each individual being categorised into 0=non-smoker, 1=Low, 2=Medium or 3=High for each of these covariates. If we are particularly interested in how intensity affects the risk, we can set up the following pseudo-profiles for (
x
INT,
x
DUR,
x
TSC): the non-smoker (0,0,0), the low intensity smoker (1,NA,NA), the average intensity smoker (2,NA,NA) and the high intensity smoker (3,NA,NA). The non-smoking pseudo-profile is included for reference, so that we can compute the odds ratio (OR) with respect to this profile for each of the other pseudo-profiles. Notice that for the intensity profiles, the other variables (duration and time since cessation) are treated as missing (denoted by NA). We discuss the technicalities of this in Additional file
1: Appendix 4.
As an output of our method, for each of our non-smoker and low, medium and high intensity pseudo-profiles, we can compute the probabilities that the pseudo-profile belongs in each cluster. These probabilities do not affect the fit of the model, which is determined wholly by the observed data. However, with these probabilities we can construct a cluster-averaged estimate of the log odds for each particular pseudo-profile. This is repeated at each stage of our model fitting process resulting in a density of these log odds (or the log odds ratio with respect to the non-smoking reference pseudo-profile) that gives us an estimate of the effect of the particular pseudo-profile. This can be compared to other pseudo-profiles, allowing us to derive a better understanding of the role of specific covariates.