Background
Chronic fatigue syndrome (CFS) affects at least 3% of the population, with women being at higher risk than men [
1]. CFS is characterized by at least 6 months of persistent fatigue resulting in substantial reduction in the person's level of activity [
2‐
4]. Furthermore, in CFS, four or more of the following symptoms are present for 6 months or more: unusual post exertional fatigue, impaired memory or concentration, unrefreshing sleep, headaches, muscle pain, joint pain, sore throat and tender cervical nodes [
2‐
4]. It has been suggested that CFS is a heterogeneous disorder with a complex and multifactorial aetiology [
3]. Among hypotheses on aetiological aspects of CFS, one possible cause of CFS is genetic predisposition [
5].
Single nucleotide polymorphisms (SNPs) can be used in clinical association studies to determine the contribution of genes to disease susceptibility or drug efficacy [
6,
7]. It has been reported that subjects with CFS were distinguished by SNP markers in candidate genes that were involved in hypothalamic-pituitary-adrenal (HPA) axis function and neurotransmitter systems, including catechol-O-methyltransferase (COMT), 5-hydroxytryptamine receptor 2A (HTR2A), monoamine oxidase A (MAOA), monoamine oxidase B (MAOB), nuclear receptor subfamily 3; group C, member 1 glucocorticoid receptor (NR3C1), proopiomelanocortin (POMC) and tryptophan hydroxylase 2 (TPH2) genes [
8‐
11]. In addition, it has been shown that SNP markers in these candidate genes could predict whether a person has CFS using an enumerative search method and the support vector machine (SVM) algorithm [
9]. Moreover, the gene-gene and gene-environment interactions in these candidate genes have been assessed using the odds ratio based multifactor dimensionality reduction method [
12] and the stochastic search variable selection method [
13].
In the studies of genomics, the problem of identifying significant genes remains a challenge for researchers [
14]. Exhaustive computation over the model space is infeasible if the model space is very large, as there are 2
p models with p SNPs. [
15]. By using feature selection techniques, the key goal is to find responsible genes and SNPs for certain diseases or certain drug efficacy. It is vital to select a small number of SNPs that are significantly more influential than the others and ignoring the SNPs of lesser significance, thereby allowing researchers to focus on the most promising candidate genes and SNPs for diagnostics and therapeutics [
16,
17].
The previous findings [
8,
9] mainly reported of modeling disease susceptibility in CFS by using machine learning approaches without feature selection. In this work, we extended the previous research to uncover relationships between CFS and SNPs and compared a variety of machine learning techniques including naive Bayes, the SVM algorithm, and the C4.5 decision tree algorithm. Furthermore, we employed feature selection methods to identify a subset of SNPs that have predictive power in distinguishing CFS patients from controls.
Results
Tables
3,
4 and
5 summarize the results of repeated 10-fold cross-validation experiments by naive Bayes, SVM (with four kernels including linear, polynomial, sigmoid, and Gaussian radial basis function), and C4.5 decision tree using SNPs with and without feature selection. First, we calculated AUC, sensitivity, and specificity for these six predictive models without using the two proposed feature selection approaches. As indicated in Table
3, the average values of AUC for the SVM prediction models of linear, polynomial, sigmoid, and Gaussian radial basis function kernels were 0.55, 0.59, 0.61, and 0.62, respectively. Of all the kernel functions, the Gaussian radial basis function kernel gave better performance than the other three kernels in terms of AUC. Among all six predictive models, the SVM model of the Gaussian radial basis function kernel performed best, outperforming the naive Bayes (AUC = 0.60) and C4.5 decision tree (AUC = 0.50) models in terms of AUC. Moreover, as shown in Table
3, the original C4.5 algorithm without using feature selection approaches used 11 out of 42 SNPs due to the fact that the search for a feature subset with maximal performance is part of the C4.5 algorithm.
Table 3
The result of a repeated 10-fold cross-validation experiment using naive Bayes, support vector machine (SVM), and C4.5 decision tree without feature selection.
Naïve Bayes | 0.60 ± 0.17 | 0.64 ± 0.20 | 0.52 ± 0.21 | 42 |
SVM with linear kernel | 0.55 ± 0.14 | 0.55 ± 0.21 | 0.56 ± 0.21 | 42 |
SVM with polynomial kernel | 0.59 ± 0.13 | 0.46 ± 0.24 | 0.71 ± 0.21 | 42 |
SVM with sigmoid kernel | 0.61 ± 0.13 | 0.62 ± 0.20 | 0.61 ± 0.19 | 42 |
SVM with Gaussian radial basis function kernel | 0.62 ± 0.13 | 0.60 ± 0.20 | 0.64 ± 0.19 | 42 |
C4.5 decision tree | 0.50 ± 0.16 | 0.52 ± 0.21 | 0.48 ± 0.21 | 11 |
Table 4
The result of a repeated 10-fold cross-validation experiment using naive Bayes, support vector machine (SVM), and C4.5 decision tree with the hybrid feature selection approach that combines the chi-squared and information-gain methods.
Naive Bayes | 0.70 ± 0.16 | 0.65 ± 0.21 | 0.60 ± 0.20 | 12 |
SVM with linear kernel | 0.67 ± 0.13 | 0.62 ± 0.20 | 0.73 ± 0.19 | 14 |
SVM with polynomial kernel | 0.62 ± 0.13 | 0.56 ± 0.21 | 0.68 ± 0.18 | 9 |
SVM with sigmoid kernel | 0.64 ± 0.13 | 0.62 ± 0.20 | 0.67 ± 0.19 | 4 |
SVM with Gaussian radial basis function kernel | 0.64 ± 0.13 | 0.58 ± 0.20 | 0.71 ± 0.18 | 3 |
C4.5 decision tree | 0.64 ± 0.13 | 0.80 ± 0.16 | 0.46 ± 0.20 | 2 |
Table 5
The result of a repeated 10-fold cross-validation experiment using naive Bayes, support vector machine (SVM), and C4.5 decision tree with the wrapper-based feature selection method.
Naive Bayes | 0.70 ± 0.16 | 0.64 ± 0.20 | 0.63 ± 0.19 | 8 |
SVM with linear kernel | 0.63 ± 0.14 | 0.71 ± 0.20 | 0.55 ± 0.21 | 9 |
SVM with polynomial kernel | 0.63 ± 0.12 | 0.43 ± 0.20 | 0.82 ± 0.16 | 12 |
SVM with sigmoid kernel | 0.64 ± 0.13 | 0.59 ± 0.21 | 0.70 ± 0.18 | 6 |
SVM with Gaussian radial basis function kernel | 0.63 ± 0.13 | 0.60 ± 0.20 | 0.66 ± 0.19 | 7 |
C4.5 decision tree | 0.59 ± 0.16 | 0.65 ± 0.21 | 0.55 ± 0.22 | 6 |
Next, we applied the naive Bayes, SVM, and C4.5 decision tree classifiers, respectively, with the hybrid feature selection approach that combines the chi-squared and information-gain methods. Table
4 shows the result of a repeated 10-fold cross-validation experiment for the six predictive algorithms with the hybrid approach. As presented in Table
4, the average values of AUC for the SVM prediction models of linear, polynomial, sigmoid, and Gaussian radial basis function kernels were 0.67, 0.62, 0.64, and 0.64, respectively. Of all the kernel functions, the linear kernel performed better than the other three kernels in terms of AUC. In addition, with the hybrid approach, the desired numbers of the top-ranked SNPs for the SVM models of linear, polynomial, sigmoid, and Gaussian radial basis function kernels were 14, 9, 4, and 3 out of 42 SNPs, respectively. Among all six predictive models with the hybrid approach, the naive Bayes (AUC = 0.70) was superior to the SVM and C4.5 decision tree (AUC = 0.64) models in terms of AUC. Moreover, the naive Bayes and C4.5 decision tree algorithms with the hybrid approach selected 12 and 2 out of 42 SNPs, respectively.
Finally, we employed naive Bayes, SVM, and C4.5 decision tree with the wrapper-based feature selection approach, respectively. Table
5 demonstrates the result of a repeated 10-fold cross-validation experiment for the six predictive algorithms with the wrapper-based approach. As shown in Table
5, the average values of AUC for the SVM prediction models of linear, polynomial, sigmoid, and Gaussian radial basis function kernels were 0.63, 0.63, 0.64, and 0.63, respectively. Of all the kernel functions, the sigmoid kernel performed best, outperforming the other three kernels in terms of AUC. Among all six predictive models with the wrapper-based approach, the SVM and C4.5 decision tree (AUC = 0.59) models were outperformed by the naive Bayes model (AUC = 0.70) in terms of AUC. In addition, the numbers of SNPs selected by these six models with the wrapper-based approach were ranged from 6 to 12 SNPs (Table
5). For the naive Bayes model with the wrapper-based approach, only 8 SNPs out of 42 was identified, including rs4646312 (COMT), rs5993882 (COMT), rs2284217 (CRHR2), rs2918419 (NR3C1), rs1866388 (NR3C1), rs6188 (NR3C1), rs12473543 (POMC), and rs1386486 (TPH2).
It is also interesting to compare results between the classifiers with and without feature selection. Feature selection using the hybrid and wrapper-based approaches clearly improved naive Bayes, SVM, and C4.5 decision tree. Overall, both the naive Bayes classifier with the hybrid approach and the naive Bayes classifier with the wrapper-based approach achieved the highest prediction performance (AUC = 0.7) when compared with the other models. Additionally, the use of SNPs for the naive Bayes classifier with the wrapper-based approach (n = 8) was less than the one for the naive Bayes classifier with the hybrid approach (n = 12).
Discussion
We have compared three classification algorithms including naive Bayes, SVM, and C4.5 decision tree in the presence and absence of feature selection techniques to address the problem of modeling in CFS. Accounting for models is not a trivial task because even a relatively small set of candidate genes results in the large number of possible models [
15]. For example, we studied 42 candidate SNPs, and these 42 SNPs yield 2
42 possible models. The three classifiers were chosen for comparison because they cover a variety of techniques with different representational models, such as probabilistic models for naive Bayes, regression models for SVM, and decision tree models for the C4.5 algorithm [
32]. The proposed procedures can also be implemented using the publicly available software WEKA [
19] and thus can be widely used in genomic studies.
In this study, we employed the hybrid feature selection and wrapper-based feature selection approaches to find a subset of SNPs that maximizes the performance of the prediction model, depending on how these methods incorporate the feature selection search with the classification algorithms. Our results showed that the naive Bayes classifier with the wrapper-based approach was superior to the other algorithms we tested in our application, achieving the greatest AUC with the smallest number of SNPs in distinguishing between the CFS patients and controls. In the wrapper-based approach, no knowledge of the classification algorithm is needed for the feature selection process, which finds optimal features by using the classification algorithm as part of the evaluation function [
29]. Moreover, the search for a good feature subset is also built into the classifier algorithm in C4.5 decision tree [
24]. It is termed an embedded feature selection technique [
34]. All these three approaches, including the hybrid, wrapper-based, embedded methods, have the advantage that they include the interaction between feature subset search and the classification model, while both the hybrid and wrapper-based methods may have a risk of over-fitting [
34]. Furthermore, SVM is often considered as performing feature selection as an inherent part of the SVM algorithm [
25]. However, in our study, we found that adding an extra layer of feature selection on top of both the SVM and C4.5 decision tree algorithms was advantageous in both the hybrid and wrapper-based methods. Additionally, in a pharmacogenomics study, the embedded capacity of the SVM algorithm with recursive feature elimination [
34,
35] has been utilized to identify a subset of SNPs that was more influential than the others to predict responsiveness to chronic hepatitis C patients of interferon-ribavirin combination treatment [
30].
In this work, we used the proposed feature selection approaches to assess CFS-susceptible individuals and found a panel of genetic markers, including COMT, CRHR2, NR3C1, POMC, and TPH2, which were more significant than the others in CFS. Smith and colleagues reported that subjects with CFS were distinguished by MAOA, MAOB, NR3C1, POMC, and TPH2 genes using the traditional allelic tests and haplotype analyses [
8]. Moreover, Geortzel and colleagues showed that the COMT, NR3C1, and TPH2 genes were associated with CFS using SVM without feature selection [
9]. A study by Lin and Huang also identified significant SNPs in SLC6A4, CRHR1, TH, and NR3C1 genes using a Bayesian variable selection method [
14]. In addition, a study by Chung and colleagues has found a possible interaction between NR3C1 and SLC6A4 by using the odds ratio based multifactor dimensionality reduction method [
12]. Similarly, another study by Lin and Hsu indicated a potential epistatic interaction between the CRHR1 and NR3C1 genes with a two-stage Bayesian variable selection methodology [
13]. These studies utilized the same dataset by the CDC Chronic Fatigue Syndrome Research Group. An interesting finding was that an association of NR3C1 with CFS compared to non-fatigued controls appeared to be consistent across several studies. Thus, this significant association strongly suggests that NR3C1 may be involved in biological mechanisms with CFS. The NR3C1 gene encodes the protein for the glucocorticoid receptor, which is expressed in almost every cell in the body and regulates genes that control a wide variety of functions including the development, energy metabolism, and immune response of the organism [
36]. A previous animal study has observed that age increases the expression of the glucocorticoid receptor in neural cells [
37], and increases in glucocorticoid receptor expression in human skeletal muscle cells have been suggested to contribute to the etiology of the metabolic syndrome [
38]. However, evidence of associations with CFS for other genes was inconsistent in these studies. The potential reason for the discrepancies between the results of this study and those of other studies may be the sample sizes. The studies conducted on small populations may have biased a particular result. Future research with independent replication in large sample sizes is needed to confirm the role of the candidate genes identified in this study.
There were several limitations to this study as follows. Firstly, the small size of the sample does not allow drawing definite conclusions. Secondly, we imputed missing values before comparing algorithms. Thus, we depended on unknown characteristics of the missing data, which could be either missing completely at random or the result of some experimental bias [
25]. In future work, large prospective clinical trials are necessary in order to answer whether these candidate genes are reproducibly associated with CFS.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
LCH and SYH participated in the design of the study and coordination. EL performed the statistical analysis and helped to draft the manuscript. All authors read and approved the final manuscript.