01.12.2013 | Research article | Ausgabe 1/2013 Open Access

# Use of generalised additive models to categorise continuous variables in clinical prediction

- Zeitschrift:
- BMC Medical Research Methodology > Ausgabe 1/2013

## Electronic supplementary material

## Competing interests

## Authors’ contributions

## Background

## Methods

### Theoretical methods

#### Generalised additive models

_{ i }are some smooth functions of the covariates X

_{ i }for each i = 1,…,p.

_{ i }, and there are different ways to address this. One of the most common alternatives is based on splines, which allow the GAM estimation to be reduced to the GLM context [12]. Splines are piecewise polynomials that join at points called knots: two major families belong to these models, i.e., regression splines and smoothing splines. Regression splines consist of selecting the number and location of the knots, and imposing restrictions so that the piecewise polynomials join smoothly. Smoothing splines [13], use as many knots as unique values of the covariate X

_{ i }and control the model’s smoothness by adding a penalty to the least-squares fitting objective [14]. An intermediate alternative to building the smooth functions -and one that considers the advantages of both smoothing and regression splines- is the use of penalised splines (also known as P-splines), introduced by Eilers and Marx [15]. P-splines use fewer knots than smoothing splines and introduce more general roughness penalties which relaxes the importance of the knot location. We have included a brief description of splines here and more detailed theoretical information is included in Appendix Appendix 1: Splines.

#### Categorisation methodology

_{0}, this is f(X) = 0 refers to the average value of the covariate. We therefore propose to start by creating an average-risk category around this average-risk point, together with as many high- and low-risk categories as are required to capture the relationship between X and f(X), as outlined in detail below.

_{0}∈ X, such that f(x

_{0}) = 0. To do so, we calculate the value for x

_{0}, by computing the inverse of f, and then the estimated value $\widehat{{\mu}_{0}}$, such that:

_{0}.

_{0}is not unique, i.e., if the graph displayed crosses the vertical axis more than once at point 0, then there will be more than one average-risk category, provided that the band at x

_{0}, is not too wide (with the band being taken to mean the confidence interval shown in the graph). In other words, if x

_{01}and x

_{02}are two values for which the graph crosses the vertical axis at point 0, two average-risk categories will be considered, as long as $\left({x}_{0{1}_{\mathit{\text{inf}}}},{x}_{0{1}_{\mathit{\text{sup}}}}\right)$ and $\left({x}_{0{2}_{\mathit{\text{inf}}}},{x}_{0{2}_{\mathit{\text{sup}}}}\right)$ do not overlap. If the last happens, we hypothesise that it may be due to two situations. The first is that one of the two intervals is based on a very small sample size which leads to a non-accurate and hence very wide interval. The second is the overlapping of two intervals of similar size. Under the first circumstance, we suggest to dismiss the interval based on a very small sample. However, if the second circumstance happens, we will consider the union of both intervals as the average-risk category.

### Implementation methods

#### Application to the IRYSS-COPD study database

#### Validation

## Results

### Categorisation process

_{0}) = 0 (x

_{0}= 22). Application of the proposed methodology to determine the limits of the average-risk category showed this category to be (20-24). It was therefore decided that the RR would be classified into 3 categories (Figure 2), with a high risk of poor evolution for values above 24 and a low risk of poor evolution for those below 20: accordingly, our final proposal for classifying the RR into three categories was ≤ 20; (20,24]; > 24.

### Validation

Derivation ( N = 805) | Validation ( N = 545) | ||||
---|---|---|---|---|---|

Variable | Cut points | AIC | AUC | p-value* | |

RR | Continuous‡ | 317.10 | 0.634 | - | |

RR | Dichotomised | ≤ 22 | 318.10 | 0.594 | 0.079 |

> 22 | |||||

RR | 3-category | ≤ 20 | 314.50 | 0.638 | 0.8198 |

(20-24] | |||||

> 24 | |||||

RR | 4-category | ≤ 20 | 316.2 | 0.640 | 0.6833 |

(20-24] | |||||

(24-30] | |||||

> 30 | |||||

PCO2 | Continuous‡ | 250.26 | 0.825 | - | |

PCO2 | Dichotomised | ≤ 47 | 281.50 | 0.742 | ≤ .0001 |

> 47 | |||||

PCO2 | 3-category | ≤ 43 | 270.76 | 0.779 | 0.0002 |

(43-52] | |||||

> 52 | |||||

PCO2 | 4-category | ≤ 43 | 258.11 | 0.810 | 0.1148 |

(43-52] | |||||

(52-65] | |||||

> 65 |

Category | Estimate | 95 %CI | p-value |
---|---|---|---|

RR ≤ 20 | -1.76 | (-2.37, -1.16) | < 0.0001 |

RR (20-24] | -1.22 | (-1.86, -0.58) | 0.0002 |

RR (24-30] | -0.56 | (-1.18, 0.06) | 0.074 |

RR > 30 | - | - | - |

Hosmer-Lemeshow test p-value > 0.05 | |||

PCO2 ≤ 43 | -3.48 | (-4.18, -2.86) | < 0.0001 |

PCO2 (43-52] | -2.62 | (-3.27, -2.03) | < 0.0001 |

PCO2 (52-65] | -1.44 | (-1.97, -0.93) | < 0.0001 |

PCO2 > 65 | - | - | - |

Hosmer-Lemeshow test p-value > 0.05 |

## Discussion

## Appendix 1: Splines

_{ i }are some smooth functions of the covariates X

_{ i }for each i = 1,…,p.

_{ i }are based on splines. Splines are piecewise polynomials, being pieces defined by a sequence of knots ζ

_{1}< ζ

_{2}< …ζ

_{ m }, in such a way that pieces join smoothly at these knots. A spline of degree r can be defined by a power series as follows:

_{ i }= B

_{ i }α

_{ i }, where for each i = 1,…,p, B

_{ i }is the B-splines matrix of N × m dimension (N number of observations and m number of knots). This is,

_{ i }= (α

_{1i },…,α

_{ m i })

^{ T }is the m-dimensional coefficient vector, associated to the B-spline basis, also called the B-splines amplitudes. As an example, in the particular case in which there is one covariate and a normally distributed response variable, the expression (4) reduces to:

_{ j k }= B

_{ k }(x

_{ j }) is the value of the k-th B-spline at the point x

_{ j }. The smoothness of the curve depends on the number of B-splines, hence, the number of knots and the value of the amplitudes, and therefore, the αvector. The advantage of this option is that B-splines are easy to build, but the main problem now resides in the optimization of the position and the number of knots.

_{ i }is the number of P-splines, this is, the number of knots, for the i

^{ t h }covariate, for each i = 1,…,p. and B is the $N\times (1+\sum _{i=1}^{p}{m}_{i})$ regressor matrix defined as

_{ i }≥ 0 are the smoothing parameters and P

_{ i }is a m

_{ i }× m

_{ i }dimension matrix that defines the penalty structure over the d-dimensional differences between every two adjacent P-splines coefficients. The estimation method is explained in detail in [23].