We used two feature selection methods, LASSO [
11] and Elastic Net [
12]. These are two forms of regularization methods that are able to automatically select the features from the dataset and hence provide a sparse solution. Regularization works in the following way: Starting from simple linear regression models we consider
\(x_{1} \ldots x_{p}\) as
x number of predictor variables (features) and
\(y\) as an outcome or response variable:
$$\hat{y} = \hat{\beta }_{0} + \hat{\beta }_{1} x_{1} + \hat{\beta }_{2} x_{2} + \cdots \hat{\beta }_{p} x_{p}$$
(1)
Here, the outcome of model fitting produces the vector of estimated regression coefficients through ordinary least squares (OLS), with the objective function as the minimum of the residual sum of the squares (RSS) equation (Eq.
2). The values minimizing the function are the estimated regression coefficients (β).
$$Residual sum of squares\left( {RSS} \right) = \mathop \sum \limits_{i = 1}^{N} \left( {y_{i} - x_{i}^{T} \beta } \right)^{2}$$
(2)
In regularization methods, an extra term is added (Eq.
3) and so the new objective function to minimize becomes:
$$RSS\left( \beta \right) + p\lambda \left( \beta \right)$$
(3)
Here
\(p\) is a function to penalize and
\(\lambda\) forms the penalty/regularization parameter. The penalty function
\(\lambda\) controls the trade-off between likelihood and penalty and so influences the variables to be selected. The higher the value of
\(\lambda\), the fewer number of variables are selected and vice versa. The differences between regularization methods lie in the different functions
\(p\) they penalize. In LASSO, the penalty is applied to the sum of the absolute values of the regression coefficients (L1 norm). Mathematically, we can write this as:
$$\frac{minimize}{{\beta \in R^{p} }}\frac{1}{2}\left\| {y - X\beta } \right\|_{2}^{2} + \lambda \left\| \beta \right\|_{1 }$$
(4)
The left part of the equation is the normal least squares criterion, whereas the right part is the penalized sum of the absolute values of the regression coefficients.
In Ridge regression [
13], the precursor of LASSO, the penalization
\(p\) is incurred in the L2 norm of the coefficients (sum of the squares). In this case, selection is not sparse since coefficients are never zero but close and so, a rank of features based on the penalised regression coefficients, is produced. Elastic Net [
12], on the other hand, is a mixed version of both LASSO and Ridge (Eq.
5). It allows for the sparse representation, similarly to LASSO, and theoretically improves its performance in
\(p \gg n\) cases with high collinear groups of features by allowing grouped selection or de selection of correlated variables. LASSO instead tends to select only one “random” variable from the group of pairwise correlations. EN is created through the merging of both Ridge and LASSO penalizations (Eq.
5). A different representation of the same equation can be seen below (Eq.
6), with a single parameter
\(\alpha\) regulating the relationship between Ridge and LASSO. When
\(\alpha\) is equal or closer to 0 we have a stronger penalization and so a solution closer or equal to LASSO whereas, when
\(\alpha\) is equal or closer to 1, the behaviour resembles Ridge.
$$\frac{minimize}{{\beta \in R^{p} }}\frac{1}{2}\left\| {y - X\beta } \right\|_{2}^{2} + \lambda_{1} \left\| \beta \right\|_{1} + \lambda_{2} \left\| \beta \right\|_{2}^{2}$$
(5)
$$\frac{minimize}{{\beta \in R^{p} }}\frac{1}{2}\left\| {y - X\beta } \right\|_{2}^{2} \;subject\;\left( {1 - \alpha } \right)\left\| \beta \right\|_{1} + \alpha \left\| \beta \right\|^{2}$$
(6)
This combination of LASSO and EN methods comprise the backbone of our pipeline and the construction is described below.