Background
Application: the Box Lunch Study
Software availability
Methods
A brief introduction to decision trees
Adjusting for covariates
Visualizing subgroups in decision trees
-
Group 1 (\(N = 86\)): Moderate to low liking, all but very high hunger. This group has below-average energy intake (standardized mean = −0.46).
-
Group 2 (\(N = 22\)): Moderate to high liking, very low relative reinforcing value of food, all but very high hunger. This group has moderate to low energy intake.
-
Group 3 (\(N = 104\)): Moderate to high liking, all but very low relative reinforcing value of food, all but very high hunger. This group has moderate to high energy intake.
-
Group 4 (\(N = 14\)): Very high hunger. This group has very high energy intake.
Methods for building decision trees
Classification and regression trees (CART)
Conditional inference trees (CTree)
Stopping rules
Comparing CART and CTree: a simulation study
rpart
package [13], while the CTree was implemented via the partykit
package [12]. We considered a variety of scenarios where we varied the data-generating function, covariate type (categorical vs. continuous), the sparsity (proportion of variables predicting the outcome), the total sample size, and the complexity parameter for CART.DMwR
package [14]. The tree-generating functions rpart
(for CART) and ctree
(for CTrees) were applied with arguments specifying a minimum of 20 observations for a node to be considered for splitting and a minimum of 7 observations in a terminal node. The complexity parameter for CART was held at the default value of 0.01. The level of significance in the CTree was held at the default value of \(\alpha = 0.05\).Effect of the data generating process
Type I error
Effect of sample size
Results
Comparing CART and CTree: a simulation study
Effect of the data generating process
True model | Type | MSE | Terminal nodes | |||
---|---|---|---|---|---|---|
Mean | SD | Mean | 20th | 80th | ||
Tree | CART | 1.26 | 0.151 | 7.01 | 6 | 8 |
Pruned CART | 1.22 | 0.137 | 4.27 | 3 | 5 | |
Pruned CART (1-SE) | 1.25 | 0.139 | 3.31 | 3 | 4 | |
CTree | 1.27 | 0.154 | 3.72 | 3 | 4 | |
Linear regression | 2.04 | 0.179 | ||||
Regression | CART | 4.12 | 0.413 | 15.24 | 14 | 16 |
Pruned CART | 4.19 | 0.442 | 13.97 | 12 | 16 | |
Pruned CART (1-SE) | 4.55 | 0.509 | 8.66 | 6 | 11 | |
CTree | 4.14 | 0.409 | 13.96 | 13 | 15 | |
Linear regression | 1.03 | 0.093 | ||||
Hybrid | CART | 1.39 | 0.138 | 13.1 | 11 | 15 |
Pruned CART | 1.37 | 0.131 | 5.96 | 3 | 9 | |
Pruned CART (1-SE) | 1.39 | 0.133 | 2.69 | 2 | 3 | |
CTree | 1.34 | 0.126 | 5.42 | 4 | 6 | |
Linear regression | 1.17 | 0.106 |
Type I error
Type | MSE | Type I error | |
---|---|---|---|
Mean | SD | Mean | |
CART | 0.65 | 0.07 | 1 |
Pruned CART | 0.99 | 0.091 | 0.0559 |
Pruned CART (1-SE) | 1 | 0.089 | 0.0003 |
CTree | 0.99 | 0.089 | 0.0513 |
Linear regression | 0.97 | 0.088 |
Effect of sample size
rpart
package, the default complexity parameter value is 0.01, so splitting stops if no split improves model fit by at least 1%. In this setting, the covariates have continuous linear effects, which implies an infinite number of population subgroups. Hence, most splits will yield small improvements in model fit, and CART variants will “stop too soon” and have poor predictive performance. In contrast, the stopping criterion for the CTree is based on p values, and maintaining a fixed p value threshold with increasing sample size allows splits associated with smaller and smaller effect sizes to be represented in the tree.Application
Estimate | SE |
t value | Pr(\({>}|\hbox {t}|\)) | |
---|---|---|---|---|
(Intercept) | 1279.36 | 211.78 | 6.04 | <0.001*** |
Sex: male | 378.03 | 66.30 | 5.70 | <0.001*** |
Body mass index | 16.68 | 6.96 | 2.40 | 0.017* |
Snack-energy kcal/day | 1.29 | 0.12 | 10.76 | <0.001*** |
Fruit/vegetable svg/day | 38.84 | 14.94 | 2.60 | 0.010** |
Sugar-sweetened beverage svg/day | 114.20 | 30.3234 | 3.77 | <0.001*** |
Contour drawing rating scale-body dissatisfaction [1–9] | −48.44 | 26.2195 | −1.85 | 0.066 |
Frequency of self-weigh | ||||
Never | (Ref) | |||
About once a year or less | −405.34 | 145.47 | −2.79 | 0.006** |
Every couple of months | −247.32 | 137.55 | −1.80 | 0.074 |
Every month | −374.43 | 147.96 | −2.53 | 0.012* |
Every week | −414.77 | 138.67 | −2.99 | 0.003** |
Every day | −450.17 | 166.89 | −2.70 | 0.008** |
Fast food frequency | ||||
Never | (Ref) | |||
1–3 times last month | 14.13 | 77.01 | 0.18 | 0.855 |
1–2 times per week | 35.63 | 95.42 | 0.37 | 0.709 |
3–4 times per week | −187.55 | 204.63 | −0.92 | 0.360 |
5–6 times per week | −235.81 | 237.61 | −0.99 | 0.322 |
7 or more times per week | 738.04 | 238.35 | 3.10 | 0.002** |
Hunger | 32.52 | 10.15 | 3.20 | 0.002** |
Wanting | 2.88 | 0.85 | 3.40 | <0.001*** |