Background
Gulf War illness (GWI) is an often-disabling condition with diverse symptoms such as chronic fatigue, cognitive dysfunction, pain, diarrhea and balance disturbance. It began as an explosive epidemic affecting tens of thousands of deployed U.S. and Coalition military personnel during and immediately after the 6-week Conflict period of the 1991 Persian Gulf War [
1]. Initial epidemiologic investigations listed symptoms and potentially toxic environmental exposures [
2] but, finding no objective signs or clinical tests to define the condition, were unable to link exposures with the disease [
3]. In 1994 Haley et al. used a 2-stage principal components analysis of 52 symptom scales in a study of 249 deployed members of a U.S. Naval Reserve construction battalion to derive the first case definition of GWI including 3 primary variants [
4]. The Research case definition was found to be strongly associated with measures of several environmental exposures including low-level organophosphate nerve agent [
5]. A series of follow-up clinical case–control studies found associations of the case definition with objective neurophysiologic, autonomic and brain imaging abnormalities [
6‐
9] as well as with a possible genetic marker, the
PON1 Q192R polymorphism, where having the R allele increases susceptibility to nerve agent neurotoxicity [
10].
In 1998 Fukuda et al. from the U.S. Centers for Disease Control and Prevention (CDC) described a simpler case definition more amenable to use in large field studies [
11]. Later known as the “CDC definition,” a positive result required endorsement of at least 2 of 10 typical GWI symptoms, and a “CDC Severe” subgroup was indicated if the positive symptoms were self-rated as “severe.” Similarly, in 2000 Steele employed a simple case definition, similar to the CDC definition, later after several changes called the “Modified Kansas definition,” which required endorsement of at least 3 of 32 typical symptoms and excluded veterans with any of 10 comorbid conditions [
12].
Two later studies applied structural equation modeling to validate the original Research case definition [
13,
14], but the CDC and Modified Kansas definitions were never validated. Subsequently additional investigators developed their own case definitions but the original Research, CDC and Modified Kansas definitions became predominant in GWI research.
In 2013 the U.S. Department of Veterans Affairs commissioned a literature review by an ad hoc committee of the Institute of Medicine to propose a standardized case definition [
15]. Finding no objective criteria on which to compare the existing case definitions, the committee recommended use of the CDC or Kansas definitions because they were judged to best cover the symptoms most commonly reported by ill Gulf War veterans. Recently, however, the U.S. Military Health Survey (USMHS) reported from a large nationally representative sample of Gulf War veterans a strong association of the original Research case definition with a gene-environment (GxE) interaction of the
PON1 Q192R polymorphism and veterans’ reports of having heard nerve agent alarms in the war. Finding strong evidence of a mechanistic interaction that could not be explained away by errors in measurement, the GxE interaction provided strong evidence of a causal role of low-level sarin in GWI [
16,
17].
Since the USMHS collected all 3 case definitions, we reanalyzed the data to compare the GWI symptom profiles of the 3 case definitions and their power to detect the PON1 Q192R GxE interaction. The findings are relevant to choosing the best uses for each case definition.
Discussion
The central finding of our study is that, of the 3 commonly used GWI case definitions, the original Research definition had twice the statistical power as the CDC and Modified Kansas definitions for detecting the associations of GWI with having heard nerve agent alarms, the enrichment of the PON1 Q192R polymorphism, and their GxE interaction. This is important because this genetic finding represents the first compelling evidence for an etiology of GWI, and without the Research definition the association would probably not have been discovered. The reason for this difference in statistical power appears related to differences in the stringency of defining a case. Most veterans meeting the Research case definition endorsed larger numbers of symptoms of greater severity associated with substantial impairment in health-related quality of life measured by the SF-12 scores; whereas, those meeting the CDC and Modified Kansas case definitions, though encompassing those meeting the Research definition, included mostly veterans with smaller numbers of milder symptoms associated with higher health-related quality of life with little difference, on average, from that of the control group of subjects not meeting any case definition. Temporarily omitting those participants also meeting the Research definition from the rest meeting the CDC and Modified Kansas-positive cases uniformly reduced the statistical power of those 2 case definitions, confirming that the large number of veterans in the remaining subset contained more misclassified subjects.
This difference in stringency of defining a case is explained by the construction of the case definitions (Table S
1). The CDC and Modified Kansas case definitions are satisfied by a veteran’s having as few as 2 or 3 individual symptoms from categories of symptoms commonly found in many conditions in civilian life. Whereas development of the Research definition started with virtually the same list of symptoms (Table S
2), it used two-stage principal components factor analysis first to parse each of the ambiguous raw symptom questions into unambiguous symptom scales, and then it used a factor-weighted sum of all the symptom scales to identify reproducible symptom complexes so that an individual veteran had to share a complex of symptoms with other ill veterans and exceed a high threshold on the syndrome scales of these complexes to be classified as a case. The resulting Research case definition presented a much higher threshold to satisfy which was met by only approximately one-third of those who met the CDC and Modified Kansas without exclusions. Consequently, CDC and Modified Kansas definitions are highly inclusive (high sensitivity) but include many non-cases in the case group (low specificity); in contrast, the Research definition selects high probability cases (high specificity) but misses many true cases (low sensitivity). The Research definition detected manifestations of environmental chemical exposures with greater specificity; whereas, by requiring only a few individual symptoms, which occur commonly in other non-war-related conditions, the CDC and Modified Kansas definitions provided greater sensitivity for a wide range of conditions from severe and mild chemical exposures to diverse chronic illnesses and injuries possibly unrelated to deployment but at the expense of lower specificity for GWI.
To directly test this explanation, we developed a method of estimating the diagnostic sensitivity and specificity of each GWI case definition using detection of the GxE interaction in place of a “gold standard” diagnostic test, which does not yet exist. From a review of the extensive literature on disease misclassification in epidemiology [
28], we adapted to our study design the mathematical model of Brenner and Savitz for correcting the odds ratio for disease misclassification in case–control studies [
29]. Their model assessed the separate and combined effects of sensitivity and specificity to determine which should be maximized in choosing a case definition for a case–control study in which the relative sample sizes of both the case and control groups could vary. In our study design, however, the control group had already been selected to contain no subjects meeting any of the 5 case definitions being compared and thus was static, which simplified our problem. Moreover, the specifications of the case definitions as well as the Venn diagram of their overlaps (Fig.
1) justified the simplifying assumptions that the CDC and Modified Kansas without exclusions had perfect sensitivity while the Research case definition had perfect specificity. With these assumptions our adaptation of the Brenner-Savitz correction equations could make corrections for disease misclassification on the RERI as a function of the specificity of the case definition used. This corrected RERI could then be compared with the biased RERI calculated directly from the study data so that the specificity at which the 2 RERI estimates agree would identify the intrinsic specificity of the case definition.
In support of our hypothesis, under the assumption of perfect sensitivity, the CDC and Modified Kansas without exclusions definitions were found to have reduced specificities of 0.82 (0.78–0.86) and 0.84 (0.80–0.88), respectively. Excluding subjects with comorbid diseases, however, reduced both specificity [0.79 (0.74–0.82)] and sensitivity [0.59 (0.55–0.63)] of the Modified Kansas definition with exclusions. Under the assumption of perfect specificity the Research and CDC Severe definitions had sensitivities of 0.40 (0.36–0.43) and 0.31 (0.28–0.35), respectively (Fig.
1B and Table
3). As Brenner and Savitz established [
29], we found that with perfect specificity, even though the reduced sensitivity caused the Research and CDC Severe definitions to miss 60% and 69% of true GWI cases, respectively, this did not affect their power to detect the RERI of the GxE interaction; whereas, the reduced specificity of the CDC and Modified Kansas definitions caused severe losses of power (Tables
2 and
3).
These findings reaffirmed the conclusion of Brenner and Savitz that for research studies case definitions that maximize specificity at the expense of sensitivity, such as Research and CDC Severe, are superior to those that maximize sensitivity over specificity, such as the CDC and Modified Kansas definitions [
29]. Consequently, employing a series of diagnostic tests, all of which must be positive to qualify as a case or careful screening of all prospective cases to remove false positives are crucial to maximize specificity. In contrast, case definitions with looser criteria tend to perform better for clinical practice where it is important to maximize the number of ill patients included in treatment, and research hypotheses are not being tested [
29].
Our finding of reduced specificity and sensitivity of the Modified Kansas definition with exclusions supports the growing practice of reducing or eliminating the exclusion of comorbidities from the Modified Kansas case definition [
30]. Phasing out the exclusions has been prompted by the realization that as veterans age, they acquire more of the age-related comorbidities, either incidentally or as GWI necessitates a sedentary lifestyle [
31]. In the original population-based study in Gulf War veterans from Modified Kansas, 34% of Gulf War veterans met the Modified Kansas case definition with exclusions [
12], but in our nationwide population-based survey performed 10 years later, only 25.6% now met the criteria after comorbidity exclusions were made. Moreover, our analysis found that the exclusions disproportionately eliminated more severely ill veterans but did not improve specificity or statistical power.
An unexpected finding was that the CDC Severe subgroup [
11] had almost as much statistical power as the Research case definition. This was due to its primarily selecting the same subset of ill veterans as the Research definition (Fig.
1A). Ironically, in our literature review we found only two instances where the CDC Severe subclassification was used in a study of GWI [
32,
33], although its relative ease of collection suggests it could be in the future. Its use, however, is also limited by the small percentage of GWI cases it selects.
A potential limitation of the study is that, whereas the Research and CDC case definitions were originally designed and applied as self-administered written questionnaires (Kansas was originally administered by telephone), in the present study the information for all 3 case definitions was acquired in the telephone interviews by trained professional interviewers following a computerized script. In adapting the original questionnaires to an interview script, we put the information into a conversational format and omitted 4 of the 32 symptom questions we found duplicative from the Modified Kansas question set as part of a reduction in interview length. While any changes are likely to alter the information obtained, the fact that we embedded the identical wording in the script and the omitted questions were duplicative suggests that the interviews collected largely the same information.
Moreover, over the years the CDC and Modified Kansas question sets have been adapted and applied variously in many contexts [
34,
35], and the list of exclusionary conditions has been altered in diverse ways [
30], both affecting the information obtained. Consequently, although our interview survey may have introduced some differences from the original applications of these case definitions, we believe that our study well captures the differences in misclassification and power of the alternative approaches to case definition development and use.
These findings have important implications for the selection and use of these case definitions in future GWI research. While all 3 detected the associations with the risk factors, the approximately 50 percent loss of statistical power by the CDC and Modified Kansas case definitions reflects that a high proportion of their cases are falsely positive misclassifications. When misclassified subjects comprise a substantial proportion of total cases, final conclusions can be severely biased [
36]. In clinical case–control studies testing for pathophysiologic or diagnostic biomarkers, common in this field, if the misclassification in the GWI diagnosis is nondifferential (i.e., unassociated with the risk factors), then the bias only reduces the power to reject the null hypothesis. In this case avoiding a type II error requires estimating the loss of power in the design phase and increasing the sample size to compensate. If, however, the bias is differential, so that only the cases spuriously diagnosed with GWI are associated with a risk factor, the investigators might falsely conclude that the risk factor is a cause of GWI. Similarly, in a randomized clinical trial of treatment in veterans meeting the CDC or Modified Kansas case definitions of GWI, a current priority of funding agencies, if many patients with mild depression, not severe enough to require hospitalization, are spuriously classified as GWI because they have, say, chronic fatigue, difficulty concentrating and functional pain—common symptoms of depression that might meet both CDC and Modified Kansas case definitions—then a treatment that improves depression but not GWI might be falsely labeled an effective treatment for GWI [
36].
To avoid such costly errors, epidemiologic and clinical case–control studies and clinical trials using the CDC or Modified Kansas case definitions should add additional tests to screen out false positives [
29], as in a recent study detecting mitochondrial dysfunction in GWI [
37]. Alternatively, they should embed sub-studies to estimate the rate of misclassification and then correct for it, a practice that has been extensively recommended but rarely applied [
38,
39]. Alternatively, use of a more restrictive case definition such as the original Research case definition or the CDC Severe subclassification, might be preferable [
36]. Since GWI prevalences are lower with these, they may incur greater costs in recruitment, but this might be preferable to falsely negative results or spurious conclusions from highly misclassifying case definitions.
Finally, some may struggle to understand why so much is being made over misclassification in the case definition when in the normal practice of epidemiology this is rarely encountered. We believe this is because in most studies the case definition is based on relatively precise measures, such as pathogen identification, diagnostic laboratory tests, etc. This avoids substantial misclassification of non-cases as cases, thus automatically achieving high Sp of the case definition. In the presence of high Sp, not capturing all the true cases (low Se) has no adverse effect on the analysis and conclusions. This is why we routinely collect only a subset of the true cases and non-cases with minimal bias. Only when studying diseases diagnosed by highly imprecise case definitions prone to misclassification of non-cases as cases, such as GWI, does low Sp of the case definition become an issue. Even then, when the low Sp of an imprecise case definition is recognized, it is often intuitively resolved by applying additional classification steps such as a diagnostic interview to weed out misclassified cases [
29].
Acknowledgements
A large research team of survey specialists at RTI International contributed importantly to the design and performed the field work for the U.S. Military Health Survey. Research leaders included Kathleen A. Considine, Vincent G. Iannacchione, Christopher P. Carson, Heather Best, Carla Bann, Darryl Creel, Barbara Alexander, Amanda Lewis-Evans, Lily Trofimovich, Kirk Pate, Anne Kenyon, Jeremy Morton, Craig Hill and Robert E. Mason. UT Southwestern team members who contributed substantially included Dr. Christine Garcia, Aimee Lamb, Rick Thompson, Eric Cordell, Jennifer Escobar and Dr. Wesley Marshall.
SF-36® and SF-12® are registered trademarks of the Medical Outcomes Trust.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.