Background
Data and methods
Description of the data
Variable | Categories | Number of | % |
---|---|---|---|
observations | |||
Smoking status | Smoker | 51696 | 27.9 |
Non-smoker | 133923 | 72.1 | |
Age | 18-29 | 33450 | 18.0 |
30-39 | 38660 | 20.8 | |
40-49 | 44193 | 23.8 | |
50-59 | 35646 | 19.2 | |
60-69 | 33670 | 18.2 | |
Sex | Male | 91160 | 49.1 |
Female | 94459 | 50.9 | |
Marital status | Married | 112853 | 60.8 |
Single | 57959 | 31.2 | |
Widowed or divorced | 14807 | 8.0 | |
Education | University or higher | 24700 | 13.3 |
High school | 82324 | 44.4 | |
Middle school | 58063 | 31.3 | |
Primary school or less | 20532 | 11.0 | |
Income | High | 87989 | 47.4 |
Medium | 74159 | 40.0 | |
Low | 23471 | 12.6 | |
Work status | Works | 107648 | 58.0 |
Does not work | 77971 | 42.0 | |
Region | North | 92741 | 50.0 |
Central | 45142 | 24.3 | |
South | 47736 | 25.7 | |
Physical activity | Active | 61357 | 33.1 |
Partially active | 70800 | 38.1 | |
Sedentary | 53462 | 28.8 | |
Alcohol consumption | High risk drinker | 18852 | 10.2 |
Low risk drinker | 52838 | 28.4 | |
Non-drinker | 113929 | 61.4 | |
Depression status | Not depressed | 173416 | 93.4 |
Depressed | 12203 | 6.6 | |
Year | 2008 | 37205 | 20.0 |
2009 | 38690 | 20.8 | |
2010 | 35896 | 19.2 | |
2011 | 36825 | 19.8 | |
2012 | 37003 | 20.0 |
Methods
Model and estimation method
Application: fitting the time VCM
Results
Smoking time-varying coefficient model
Model | Description | Time (min) | p-value | H0 used | AIC | df |
---|---|---|---|---|---|---|
of test | for test | |||||
Selection of variables that have varying coefficients
| ||||||
LM | logistic model | <1 | - | - | 206590.4 | 21.00 |
Model age | LM + s(t):age | 4.7 |
<0.001
| LM | 206554.4 | 28.88 |
Model alcohol | LM + s(t):alcohol use | 2.1 |
<0.001
| LM | 206566.0 | 27.39 |
Model physical | LM + s(t):physical activity | 2.1 |
<0.001
| LM | 206570.1 | 26.19 |
Model income | LM + s(t):income | 2.2 |
<0.001
| LM | 206574.8 | 26.33 |
Model mstatus | LM + s(t):martital status | 2.1 |
<0.001
| LM | 206573.8 | 26.50 |
Model edu | LM + s(t):education | 3.1 |
<0.001
| LM | 206573.2 | 26.26 |
Model sex | LM + s(t):sex | 1.4 |
<0.001
| LM | 206572.4 | 25.37 |
Model work | LM + s(t):work status | 1.4 |
<0.001
| LM | 206575.2 | 25.08 |
Model region | LM + s(t):region | 2.2 |
<0.001
| LM | 206573.2 | 26.81 |
Model depress | LM + s(t):depression status | 1.4 |
<0.001
| LM | 206574.2 | 24.49 |
Model time | LM + s(t) | 1.6 |
<0.001
| LM | 206572.7 | 23.72 |
Finding the full varying coefficient model
| ||||||
Model I | Model age + s(t):alcohol use | 8.9 |
0.008
| Model age | 206548.7 | 33.21 |
Model II | Model I + s(t):physical activity | 16.8 | 0.070 | Model I | 206547.6 | 36.40 |
Model III | Model II + s(t):income | 25.7 | 0.261 | Model II | 206549.0 | 38.45 |
Model IV | Model II + s(t):marital status | 24.4 | 0.539 | Model II | 206550.8 | 38.72 |
Model V | Model II + s(t):education | 36.5 | 0.227 | Model II | 206549.4 | 39.48 |
Model VI | Model II + s(t):sex | 24.4 | 0.125 | Model II | 206547.3 | 37.34 |
Model VII | Model II + s(t):work status | 21.4 | 0.550 | Model II | 206549.3 | 27.38 |
Model VIII | Model II + s(t):region | 26.5 | 0.470 | Model II | 206550.0 | 38.32 |
Model IX | Model II + s(t):depression status | 22.0 | 0.369 | Model II | 206548.9 | 37.42 |
Model X | Model I + s(t) | 10.4 |
0.006
| Model I | 206548.7 | 33.21 |
Notes: s(t) - spline of time
|
bam
function which is designed for large datasetsa. The final model, Model X, can be written as: Variable | OR (95% C.I.) | p-value |
---|---|---|
Age (Reference: 60-59) | ||
18-29 | 2.08 (1.91-2.25) |
<0.001
|
30-39 | 1.79 (1.64-1.95) |
<0.001
|
40-49 | 1.75 (1.61-1.89) |
<0.001
|
50-59 | 1.46 (1.34-1.60) |
<0.001
|
s(time):18-29 | - |
0.003
|
s(time):30-39 | - | 0.649 |
s(time):40-49 | - | 0.046 |
s(time):50-59 | - | 0.619 |
s(time):60-69 | - | 0.621 |
Sex (Reference: Female) | ||
Male | 1.61 (1.58-1.64) |
<0.001
|
Marital status (Reference: Married) | ||
Single | 1.47 (1.44-1.51) |
<0.001
|
Widowed or divorced | 1.84 (1.78-1.90) |
<0.001
|
Education (Reference: University | ||
or higher) | ||
High school | 1.36 (1.32-1.39) |
<0.001
|
Middle school | 1.81 (1.75-1.87) |
<0.001
|
Primary school or less | 1.44 (1.38-1.51) |
<0.001
|
Income (Reference: High) | ||
Medium | 1.30 (1.27-1.32) | <0.001 |
Low | 1.78 (1.73-1.83) |
<0.001
|
Work status (Rerference: Works) | ||
Does not work | 0.74 (0.73-0.76) |
<0.001
|
Region (Reference: North) | ||
Centre | 1.21 (1.19-1.24) |
<0.001
|
South | 1.10 (1.07-1.12) |
<0.001
|
Physical activity (Reference: Active) | ||
Partially active | 0.95 (0.93-0.97) |
<0.001
|
Sedentary | 1.15 (1.12-1.18) |
<0.001
|
Alcohol consumption (Reference: | ||
High risk drinker) | ||
Low risk drinker | 0.69 (0.64-0.75) |
<0.001
|
Non-drinker | 0.47 (0.43-0.52) |
<0.001
|
s(time):High risk drinker | - | 0.353 |
s(time):Low risk drinker | - | 0.232 |
s(time):Do not drink | - | 0.038 |
Depression (Reference: | ||
Not depressed) | ||
Depressed | 1.43 (1.39-1.48) |
<0.001
|
Discussion
Conclusion
Endnote
mgcv
package in R
software [42,43] and using the gam
function that is used for fitting generalized additive models. However, since the data in the presented application is relatively large, to save computation time, the bam
function of the package is used which works like the gam
function but is designed for large datasets [43]. This can be especially useful in surveillance data analysis as longer periods of observation indicate very large and increasing sample size. When compared to the gam
function, the bam
function can take minutes to fit the most complicated model compared to several hours and even days depending on the sample size. For this function to perform even faster, the method used for selection of the smoothing parameter λ is by the fast REML computation method instead of the generalized cross validation method usually used by gam
function. To use this function for estimating a varying coefficient model the “by”
option is used as shown in the following example:bam(SMK
∼Zj + INC + s(time, bs = ~ps~, k=55, m=c(3,2), by = INC), family = binomial(~logit~)),
SMK
is the response variable for smoking status, INC
is the independent variable for income status, Z
j
are all the other independent variables with constant coefficients, ps
is for P-spline estimation, k=55
is the number of knots, and m=c(3,2)
indicates the use of the third degree B-spline bases with a second order difference penalty.plot.gam
function of the mgcv
package, Bayesian confidence intervals are used for plotting of the smooth terms, which can be obtained by simulating from the posterior distribution of the functional coefficients (or varying coefficients) [39] For model selection, esting between nested models was performed using anova(model 1,model 2, test="Chisq") [39]. In addition the AIC of the models were found using the AIC function in R.