Sample
The sample was already described in the original paper [
32]. Briefly, two hundred and seventeen depressed inpatients were included in this study (age = 52.11 ± 12.04; onset = 37.97 ± 12.16; female/male: 144/73; bipolars: delusional/non delusional = 40/33, major depressives: delusional/non delusional = 71/73). All patients were evaluated at baseline and weekly thereafter until the sixth week using the 21-item Hamilton Rating Scale for Depression (HAM-D-21) [
33] administered by trained senior psychiatrists blind to genetic data and to treatment (fluvoxamine 300 mg daily from day 8 plus pindolol 7.5 mg to one third of the sample). A decrease in HAM-D scores to 8 or less was considered the response criterion. After the procedure had been fully explained to all subjects, informed consent was obtained.
Plasma fluvoxamine levels were determined by high-performance liquid chromatography after 2 weeks of stable 300 mg daily dose [
34]. Nine patients with extreme plasma levels (more than 2 standard deviations) were removed from the study in order to avoid biases due to side effects that are present at high doses, also subjects with plasma levels below 20 ng/ml were excluded as this may indicate non compliance, but no cases with such low doses were observed. The influence of both SERTPR and TPH polymorphisms was limited to subjects not taking pindolol [
32] therefore we included in the present study the 121 subjects including fluvoxamine alone (81 responders/40 non responders). DNA analysis was performed as described in the original paper [
32].
Model development and selection
An "intent-to-treat" analysis was carried out for all patients who had a baseline assessment and at least 1 assessment after randomization, with the last observation carried forward on the HAM-D. For the current application the inputs to the first layer of the neural network consist of SERTPR and TPH genotypes while the target outputs consist of response status. The network is then trained to attempt to predict response from genotypes. Each node of the input layer of the network is set to a value representing the genotype of each polymorphism. For each polymorphism and for each subject this value is set to genotypes aa, ab or bb. If a marker genotype is missing then the input is assigned a value equal to the average of the values for all subjects in the dataset, however no missing data were present in our sample. The target output for the network is set to 1 or 2 depending on whether the subject is responding or not.
The best network was selected on the basis of its discriminating error and performance, positive and negative predictive values were also reported for each model. This last was expressed as area under the Receiving Operator Characteristic (ROC) Curve. The area under a ROC curve ranges from zero to one, with values close to unity indicating better predictive power; an area of 0.5 indicates that the model is not predicting better than a random choice.
However, one major problem of NN analyses is to establish if the prediction from genotypes is greater than would be expected by chance. If the whole sample is used for training, the network will to some extent "learn to recognise" particular features of each member of the dataset and can use these to predict response in a way which may not reflect any general association between marker genotypes and disease. Generally, this problem is faced by a set of strategies: dividing the dataset (50:50, 80:20...), Jackknife, bootstrapping, cross-validation and so on. However those methods present some disadvantages, in particular if only a part of the data is used to train the network this leads to a loss of power given that subjects in the validating part have different patterns of association between genotypes and drug response.
In order to remedy these problems, in the case of MLP, it has been suggested to perform both training and testing on the entire dataset. The statistical significance of any observed association between outputs and affection status can be estimated using a permutation test [
25].
Once the network was defined, a statistic, denoted T, is calculated to compare the outputs for responders and non responders in the same way as an unpaired t statistic, although the statistic is not expected to follow a t distribution under the null hypothesis. Instead, in order to estimate statistical significance a permutation procedure is performed. A large number of replicate data sets are generated from the original data and the obtained network model by randomly permuting genotypes with respect to affection status. For each of these replicate data sets we can then train and test the data set as before, each time calculating T. Since each permuted data set will have only random association between genotype and affection status we obtain N values of T which provide a distribution of T under the null hypothesis. We count the number of times any of these values exceeds the value of T we obtained for the real dataset and denote this number R. Then (R + 1)/(N + 1) provides an unbiased estimate of the statistical significance of the association between genotype and affection status in the real dataset.
In order to estimate a p-value of alpha, one should carry out approximately 10/alpha replicates. Typically, in order to detect association at a significance of 0.01 one would perform 1000 replicates (including the real dataset and 999 permuted datasets). In the case of the present paper we performed 10000 replicates.
Multiple regression and discriminant function analyses were performed to compare the results obtained with the NN strategy with traditional techniques. Responder status was the dependent variable with SERTPR and TPH as independent variables. Genotypes were scored in the following way according to the hypothesis of codominance (SERPR*l/l = 1, SERPR*l/s = 2, SERPR*s/s = 2, TPH*C/C = 1, TPH*C/A = 2, TPH*A/A = 2).
Calculations for the NN selection were performed using STATSOFT (Kernel release 5.5 A). Evaluation was performed using the NNPERM package [
31].