Data for this analysis were drawn from the Nicotine Dependence in Teens (NDIT) Study,[
10] an ongoing prospective cohort investigation of 1,293 students initially aged 12-13 years recruited from grade 7 classes in a convenience sample of 10 secondary schools in Montreal, Canada. The primary objective is to describe the natural course of ND in relation to cigarette smoking. Over half (55.4%) of eligible students participated; the low response related, in part, to a labour dispute that resulted in some teachers' refusing to collect consent forms. Participants provided assent and parents/guardians provided signed informed consent. Questionnaire data were collected every 3 months during the 10-month school year over a 5-year follow-up period until participants completed secondary school, for a total of 20 cycles[
11]. The study received ethics approval from the Montreal Department of Public Health Ethics Review Committee, the McGill University Faculty of Medicine Institutional Review Board and the Ethics Review Committee at the CRCHUM.
Study variables
Time of initiation of daily cigarette smoking was identified using data collected in a past 3-month recall of cigarette use[
12] completed in each cycle. The recall included one item for each of the three months preceding questionnaire administration, which measured number of days on which the participant had smoked cigarettes during that month, and one item for each month that measured number of cigarettes smoked per day on average during that month. Three month test-retest reliability for these two items was very good[
13]. If participants checked that they had smoked cigarettes on all 30 days in any of the past three months covered in each cycle, they were categorized as daily cigarette smokers (as of that cycle). Initiation of daily smoking was considered to have occurred during the cycle in which the participant reported smoking daily for the first time.
Seven prognostic indicators were selected based on their association with the initiation of daily smoking, as previously assessed in the NDIT cohort[
10] and on the feasibility of collecting accurate data from youth in a clinical setting as indicated by features such as clarity and simplicity of the question to be asked, and ease and rapidity of assessment. Specifically, these included sex, lifetime smoking history (ever, never), ever felt like you really need a cigarette (no, yes), parent(s) smoke (no, yes), sibling(s) smoke (no, yes), friends smoke (no, yes), and alcohol use (never, yes).
Lifetime smoking history was measured in two items: (i) "Have you ever IN YOUR LIFE smoked a cigarette, even just a puff (drag, hit, haul)?" Response choices included no; yes, 1 or 2 times; yes, 3 or 4 times; yes, 5-10 times; and yes, more than 10 times; and (ii) "During the past 3 months, how often did you smoke a cigar or cigarillo?" Response options included never, a bit to try, once or a couple of times a month, once or a couple of times a week, and every day. Participants were categorized as an "ever smoker" if they had a positive response to either item.
"Need a cigarette" was measured in a single item: "How often have you felt like you really need a cigarette?" The four response choices included never, rarely, sometimes, and often. For analysis, responses were recoded into no (never) and yes (rarely, sometimes, often).
Parental smoking was measured by: "Does your father currently smoke cigarettes?" and "Does your mother currently smoke cigarettes?" with response options including no and yes (for each parent). For analysis, a new variable, "parent smoking", was created with response options including no (neither parent smoked) and yes (one or both parents smoked).
"Sibling smoking" was measured by "You have □ sisters who smoke cigarettes" and "You have □ brothers who smoke cigarettes". Participants were instructed to write the number of sisters/brothers who smoke in the box. If they had no sisters/brothers who smoked, they were instructed to write 0 in the box. For analysis, responses were recoded to no (no sibling smokes) and yes (one or more sibling smokes)
"Friends smoking" was measured by "Now think about your friends. How many of the people whom you usually hang out with smoke cigarettes?" The five response options included none, a few, about half, more than half, most or all. For analysis, responses were recoded into none or a few or more (a few, about half, more than half, most or all).
"Alcohol use" was measured by "During the past 3 months, how often did you drink alcohol?" Response options included never, a bit to try, once or a couple of times a month, once or a couple of times a week, and every day. For analysis, responses were recoded into no (never) and yes (a bit to try, once or a couple of times a month, once or a couple of times a week, and every day).
Mother's education was measured by presenting the respondents with the following five response options: did not finish high school, high school graduate, vocational, technical school, CEGEP, and university.
Data analysis
The database to study the 1-year risk of daily smoking was created in five steps: (i) observations were divided into four consecutive 1-year waves, each including five data collection cycles (i.e., 1-5, 5-9, 9-13, 13-17); (ii) we determined if participants had initiated daily smoking within each 1-year wave; (iii) if, at the beginning of a 1-year wave, the participants had been categorized as a daily smoker, he/she was removed from that wave and all subsequent 1-year waves; (iv) data on the covariates were drawn from the "baseline" cycle within each wave. (v) data for all participants for all 1-year waves up to and including the cycle in which participants initiated daily smoking or follow-up ended, were pooled across participants and waves. We used the method of multiple imputation to deal with missing values of the covariates. Specifically, we carried out multiple imputation by chained equations with Gibbs sampling using the MICE package available in R[
14]. Twenty-five imputation models were run, which included daily smoking, the covariates representing the seven prognostic indicators, and mother's education, which was included as an indicator of socio-economic status due to its potential to be an important determinant of non-response and/or other sources of missingness.
A second database was created to compute the 2-year risk of becoming a daily smoker by subdividing observations into two consecutive 2-year waves, which each included nine cycles (i.e., 1-9, 9-17). The steps to create this second database were analogous to those described above.
Multivariable logistic regression analyses were used to estimate regression parameters, as well as statistics and indicators assessing the model goodness-of-fit and predictive ability. Separate models were fitted for 1-and 2-year risk analyses. The dependent variable was represented by the indicator of initiation of daily smoking over the relevant risk period, and the independent variables were represented by the seven prognostic indicators. We tested potential interactions between the independent variables by adding pair-wise product terms between them to the "main effects only" model to check if any given product term necessitated inclusion. However, none was found to be statistically significant, so that the "main effects only" models were retained as the final models. The description of specific patterns of missingness is provided in Tables
1 and
2
Table 1
Patterns of missingness in the 1-year risk analysis.
2673 | + | + | + | + | + | + | + | + | + | + |
51 | + | + | + | + | + | + | + | - | + | + |
101 | + | + | + | + | + | + | + | + | - | + |
7 | + | + | + | + | - | + | + | + | + | + |
19 | + | + | + | + | + | - | + | + | + | + |
29 | + | + | + | + | + | + | - | + | + | + |
2 | + | + | + | - | + | + | + | + | + | + |
499 | + | + | + | + | + | + | + | + | + | - |
6 | + | + | + | + | + | + | + | - | - | + |
1 | + | + | + | - | + | - | + | + | - | + |
2 | + | + | + | + | + | + | - | + | - | + |
1 | + | + | + | + | + | + | + | + | - | + |
17 | + | + | + | + | + | + | + | - | + | - |
27 | + | + | + | + | + | + | + | + | - | - |
1 | + | + | + | + | - | + | + | + | + | - |
1 | + | + | + | + | + | - | + | + | + | - |
6 | + | + | + | + | + | + | - | + | + | - |
2 | + | + | + | - | + | + | + | + | + | - |
1 | + | + | + | + | + | - | + | - | + | - |
1 | + | + | + | + | + | - | + | + | - | - |
1 | + | + | + | - | + | + | + | + | - | - |
6 | + | + | + | + | - | - | - | - | - | + |
1 | + | + | - | + | - | - | - | - | - | + |
9 | + | + | + | + | - | - | - | - | - | - |
3 | + | + | - | + | - | - | - | - | - | - |
Table 2
Patterns of missingness in the 2-year risk analysis.
1238 | + | + | + | + | + | + | + | + | + | + |
29 | + | + | + | + | + | + | + | - | + | + |
13 | + | + | + | + | + | + | + | + | - | + |
7 | + | + | + | + | - | + | + | + | + | + |
7 | + | + | + | + | + | - | + | + | + | + |
13 | + | + | + | + | + | + | - | + | + | + |
3 | + | + | + | - | + | + | + | + | + | + |
219 | + | + | + | + | + | + | + | + | + | - |
1 | + | + | + | + | + | + | + | - | - | + |
8 | + | + | + | + | + | + | + | - | + | - |
4 | + | + | + | + | + | + | + | + | - | - |
1 | + | + | + | + | - | + | + | + | + | - |
3 | + | + | + | + | + | + | - | + | + | - |
14 | + | + | + | + | - | - | - | - | - | + |
1 | + | + | + | - | - | - | - | - | - | + |
6 | + | + | + | + | - | - | - | - | - | - |
3 | + | + | + | - | - | - | - | - | - | - |
Potential model overfitting (which could result in the prognostic indicators appearing more discriminating than they actually are) was addressed in bootstrap-based cross-validation (relying on 10,000 replication samples with replacement taken from the analytic dataset)[
15]. This allowed us to correct the overfitting bias by applying correction factors (i.e. "shrinkage") to the regression coefficients estimated by the "naïve" logistic models so as to derive their bias-corrected counterparts[
16,
17]. Specifically, this was carried out as follows. For each of the 10,000 bootstrap samples, the logistic regression model was fitted, producing 10,000 sets of estimated regression coefficients. These were then combined with realizations of the corresponding prognostic indicators to produce 10,000 linear predictor values. Next, logistic regressions were fitted with the linear predictor serving as the only independent variable, producing 10,000 sets of estimated regression coefficients: B
0 (i.e. the intercept) and B
1 (i.e. the slope). The 10,000 slope values were then averaged to produce the value of the "shrinkage" factor. The overfitting-corrected regression coefficients were obtained by multiplying the regression slope coefficients from the "naïve" model by the "shrinkage" factor.
We assessed goodness-of-fit of the overfitting-corrected logistic models' by comparing the observed versus expected numbers of outcome events within risk strata, and by carrying out the Hosmer-Lemeshow test[
18]. Further, we examined the models' predictive ability by calculating the maximum-rescaled R
2[
19] and the c-statistic[
20]. Finally, we assessed the degree of discriminating informativeness of the fitted logistic models (i.e. the extent to which the models are able to risk-stratify) as follows. First, the variance in outcome event probability estimates that would be provided by a hypothetical perfect regression model was calculated as the variance of the distribution of the actual outcome events in the study sample (because a perfect model would produce the probability estimates of 0 for all individuals who would not experience the outcome event during the risk period and the probability estimates of 1 for those who would). Second, the variance in the outcome events' probability estimates provided by the actual fitted models was estimated. The ratio of the latter estimate of variance to the former thus provides a measure of the discriminating informativeness of the actual fitted model relative to the hypothetical perfect model. This measure thus ranges between 0 and 1, with the ratio equal to 0 corresponding to a totally non-informative model and the ratio equal to 1 corresponding to a perfect model.
The regression coefficients estimated in the overfitting-corrected logistic regression models were converted into user-friendly tables, to facilitate their application in practice. All analyses were conducted using SAS v9.13.