B. Shrinkage methods (ridge, LASSO, and elastic net)
Recently, the variable selection methods such as the least absolute shrinkage and selection operator (LASSO) and elastic net have gained popularity in high-dimensional statistical problems [
20,
21]. We used the glmnet function from glmnet package in R, which implements the elastic net where the LASSO is a special case [
22]. In the glmnet function, it is flexible to implement the ridge regression (α = 0) and LASSO (
α = 1), and anything in between (elastic net, 0 ≤ α ≤ 1). In other words, the elastic net solves
$$ \underset{\beta_0,\beta }{\min}\left(-\left[\frac{1}{n}\sum \limits_{i=1}^n{y}_i\left({\beta}_0+{x}_i^T\beta \right)-\log \left(1+{e}^{\left({\beta}_0+{x}_i^T\beta \right)}\right)\right]+\lambda \left[\frac{\left(1-\alpha \right){\left\Vert \beta \right\Vert}_2^2}{2}+\alpha {\left\Vert \beta \right\Vert}_1\right]\right). $$
Hence, the
α term can be flexible to control how variables are selected based on the coefficients
β.
For our paper, we focused on LASSO (α = 1). In LASSO, the procedure simply selects variables based on the L1 hard threshold of the coefficients, ‖
β‖
1 (as opposed to ridge regression which only smooths out, or shrinks, the coefficients [
10]). Any other
α values smaller than one will also have the effect of shrinkage, which we do not want here. Since the (α = 1) is fixed, it remains to calculate the other parameter,
λ, to complete the process of selecting the variables via LASSO. The glmnet function in R recommends that the users view the entire solution path consisting of results from all possible
λ values, but practically this is unfeasible so the program selects two plausible
λ values:
lambda.min, the value of lambda that gives minimum mean cross-validated error, or
lambda.1se, the largest value of lambda such that the error is within one standard-error of the minimum—the so called “one-standard-error” rule [
22]. Hence, we consider two sets of selected variables based on lambda.min and lambda.1se.
Normally, one assigns test and validation sets within sample to perform classifications. However, since we have a limited sample size, we employ the
k-fold cross validation (CV) [
10]. For the
k-fold CV, the sample is randomly divided into
k roughly equal sized sets, and one of the
K sets is left out. A classifier is fit with the
K − 1 sets (training set) and validated using the remaining (left out) set. We repeat this for each of the
K sets; the
k-fold CV classification rate is obtained by averaging the
k individual classification rates. Now, we repeat the process
M times (Monte Carlo simulation of the
k-fold CV), because the partition of the
k-fold is random so that each time it gives us a different result. Hence, we obtain a more reliable estimate of the
k-fold CV by repeating the procedure
M times and also obtain the distribution of the
k-fold CV classification rates, including the mean and the standard deviation (standard error). We set
M = 1,000 and report the 1000 Monte Carlo mean of the
k-fold CV classification rates. The computation times for the eight classifiers range 2–3 s for all algorithms except for AdaBoost (177 s), random forest (9 s) and KNN (< 1 s).