Synthesis analysis and NEW-STROKE model
Synthesis analysis is a statistical method used to develop a multivariate regression model integrating the incomplete regression models and correlations among the independent variables of interest. In our study, we used Synthesis analysis method to propose the NEW-STROKE model, which included seven more risk factors than the FSRS model in predicting a future patient. Specificity, FSRS model is one of the most well-regards stroke risk appraisal tools, and can be regarded as a baseline equation in developing the NEW-STROKE model. Additional seven risk factors contained African American ethnicity, physical exercise level, body mass index, waist circumference, height, HDL cholesterol and the use of hormone replacement therapy derived from different longitudinal studies and comprehensive meta-analysis of the medical literatures. Let Z
1, Z
2, ⋯, Z
7 denote these seven covariates respectively.
There are some assumptions in developing the NEW-STROKE model for predicting the stroke risk score by synthesis analysis method. The key assumption is that the input information, namely the associations of each risk factor with the stroke score and correlations among the multiple risk factors, are representative in the population. Another one is that these seven risk factors we chose in adding to our new model must be available in the training set which used to construct the model. In our study, the third National Health and Nutrition Examination Survey (NHANES III) has been used to propose the NEW-STROKE model for predicting stroke risk score. Because the NHANES III data is a representative sample of the US population and can be regarded as the only “superpopulation”. And these variables are all available both in NHANES III and ARIC data. The third one is the correlations between these risk factors and outcome of Stroke are exchangeable across different studies. The detailed procedure of constructing NEW-STROKE is as follows.
The dependent variable is the logit of the probability of stroke outcome, denoted by Y. Framingham Stroke Risk Score model can be regarded as the gold standard for stroke risk prediction, and be considered as a baseline equation in developing new model. The following steps can illustrate the detailed synthesis analysis methodology.
Assume the baseline empirical logistic model is:
$$ L Y=\alpha +{\beta}_1{X}_1+{\beta}_2{X}_2+\cdots +{\beta}_9{X}_9 $$
with
LY denoting the logit of
Y.
The parameters α, β
1, β
2, ⋯, β
9 had been estimated in Framingham Heart Study. We use \( \widehat{\alpha}+\widehat{\beta_1}{X}_1+\widehat{\beta_2}{X}_2+\cdots +\widehat{\beta_9}{X}_9 \) to predict LY for each patient who was followed up in the NHANES III. Denote the prediction by \( \widehat{LY} \). The correlations between the additional seven risk factors (Z
1, Z
2, ⋯, Z
7) and stroke outcome represent by γ
1, γ
2, ⋯, γ
7, which derived from the medical literatures.
Firstly, we use \( \widehat{\alpha}+\widehat{\beta_1}{X}_1+\widehat{\beta_2}{X}_2+\cdots +\widehat{\beta_9}{X}_9 \) to calculate the predicted dependent variable \( \widehat{LY} \) for each patient in the NHANES III.
Secondly, for the same dataset, \( \widehat{LY} \) can be used as dependent variable, the first additional risk factor Z
1 is the independent variable, build the linear regression model \( \widehat{LY}={\delta}_1+{\varsigma}_1{Z}_1 \). We use a weighted least square method to obtain the regression coefficient \( \widehat{\varsigma_1} \). \( \widehat{\varsigma_1} \) represents the association of \( \widehat{\alpha}+\widehat{\beta_1}{X}_1+\widehat{\beta_2}{X}_2+\cdots +\widehat{\beta_9}{X}_9 \) with Z
1, also means that how much of Z
1 is captured in the baseline equation. γ
1 denotes the correlation between the first risk factor Z
1 and stroke outcome.
Thirdly, we use the difference between
γ
1 and
\( {\widehat{\varsigma}}_1 \) to reflect the association of the first risk factor
Z
1 with the stroke that was not captured in the baseline equation. The new equation has the form:
$$ L Y1=\widehat{LY}+\left({\gamma}_1-\widehat{\varsigma_1}\right){Z}_1=\widehat{\alpha}+\widehat{\beta_1}{X}_1+\widehat{\beta_2}{X}_2+\cdots +\widehat{\beta_9}{X}_9+\left({\gamma}_1-\widehat{\varsigma_1}\right)\left({Z}_1-\overline{Z_1}\right) $$
here,
\( \overline{Z_1} \) indicate the mean of
Z
1. Constant
\( \widehat{\alpha} \) keeps the same when
Z
1 is mean centered. Then, this new equation has been regarded as a new baseline equation. We continue to add the second interested risk factor to this new baseline equation. We repeat the first step to third step until all interested risk factors are included in the final model.
The final equation is:
$$ \begin{array}{l} LY7=\widehat{\alpha}+\widehat{\beta_1}{X}_1+\widehat{\beta_2}{X}_2+\cdots +\widehat{\beta_9}{X}_9+\left({\gamma}_1-\widehat{\varsigma_1}\right)\left({Z}_1-\overline{Z_1}\right)\\ {}\kern2em +\left({\gamma}_2-\widehat{\varsigma_2}\right)\left({Z}_2-\overline{Z_2}\right)+\cdots +\left({\gamma}_7-\widehat{\varsigma_7}\right)\left({Z}_7-\overline{Z_7}\right)\end{array} $$
LY7 is our proposed NEW-STROKE model by synthesis analysis method. The detailed steps are our procedure of synthesis analysis.
Therefore, the predicted probability of a patient who will develop a stroke is:\( p=\frac{1}{1+ exp\left(- LY7\right)} \). We can use this formula to predict the probability of having a stroke for a future patient.