Normative data are sometimes seen as a uniquely neuropsychological problem. However, all types of brain measurements (including biomarkers) are sensitive to non-disease effects, and in particular demographic effects. This issue requires particular consideration in neuropsychology as individual neuropsychological measures (typically administered as part of a test battery) have unique and complex relationships with demographic variables; for example, some have non-linear relationships with brain functions, while others are contextual in nature (e.g., socio-historical effects of ethnicity). The picture is further complicated by the high degree of inter-relatedness amongst demographic variables (e.g., ethnicity or geographical location is sometimes a proxy variable for more direct effects of education and socioeconomic status [
16]). However, this is first and foremost a reflection of the brain’s complexity. “Good normative data,” that is,
datasets based on a large sample size with well-identified demographic effects broadly representative of a group of people (usually a nation) for each neuropsychological measure is seen as a luxury because it is resource intensive. On the contrary, this approach is in fact less costly at a national level than many other scientific methods because benefits are cross-disciplinary.
Importantly, acquisition of neuropsychological data in a healthy control group does not in and of itself constitute “good” normative data per se. Accurate quantification of demographic and/or socio-cultural effects is critical for an optimal norming process.
Large samples are necessary to stabilize demographic and other effects relevant to normal performance. This is crucial both for ensuring representativeness (typically at a national level) and stabilizing inter-correlation within a test battery so that the factor loadings reflecting “normal” functioning across cognitive domains can be approximated as closely as possible. This is not to say that small- to medium-sized samples (
N = 50–100) of HIV− persons cannot be used as a normative reference in HIV research; however, there are important conditions and limitations to their use. Indeed, if the Frascati criteria are to be applied optimally, then they should only be used in relation to a restricted and closely comparable HIV+ sample (preferably of similar size and, as a bare minimum, comparable for age and sex) [
17]. To illustrate some of the issues associated with using a small HIV− control sample to assess the validity of the Frascati criteria, we specifically review a recent study by Meyer et al. [
10], which analyzed the false-positive rates arising from different computations of the criteria in a Kenyan HIV− sample (
N = 84) as well as a simulation sample. The demographic characteristics of the Kenyan sample were not presented including key variables that we know dramatically influence the stability of normal neuropsychological performance in limited-resource settings [
18,
19]. The study also does not report if the tests were culturally adapted for the Kenyan sample, which makes it even harder to determine if uncontrolled socio-demographic effects have affected their performance. Under such circumstances the construct validity of a neuropsychological battery can be substantially reduced, resulting in a given test measuring construct(s) other that the cognitive function(s) that it is intended to measure. In this instance, some of the explanatory variance due to demographic factors is likely interfering with the test construct, meaning that in the context of correctly applying the Frascati criteria, the HIV− sample can only be utilized if compared with a closely matched HIV+ sample, as the criteria assumes that test constructs will be similar for both samples. As further support for their arguments, the criteria were also tested in a somewhat vaguely defined simulated normal sample. However, their computations assumed configurations and correlational structures amongst the test battery that generally fail to reflect the neuropsychology methods advocated in the Frascati criteria, and as such their conclusions serve only to reiterate existing psychometric knowledge gleaned from Classical Test Theory [
20]. Even when the CNS HIV Anti-
Retroviral Therapy Effects Research (CHARTER) study test battery was considered, at no time did the authors correctly compute the Frascati criteria, as they failed to take into account the neuropsychological measure/domain count specific to this test battery [
21]. Their proposition to apply a cut-off at −1.5 SD/cognitive domain is in fact already in use if at least two neuropsychological measures are included in a cognitive domain, as detailed above. It would be interesting, however, if the simulation work was re-conducted after correctly applying the Frascati criteria, and possibly using the Global Deficit Score (GDS) method as they suggest (also see the later section on updating the Frascati criteria).
Strictly speaking, normative data include demographic corrections that have been carefully identified in the norming process [
16]. This can be achieved using one of two approaches. One is to develop
z-scores stratified by age and education ranges (and sometimes sex). This strategy is used by most test developers in samples that are rarely
N > 300, except for some major test batteries [
16]. The application of such
z-scores is restricted to clinical samples with closely comparable demographics, as explained above [
16]. A second, more sophisticated method is to create demographically corrected
T-scores (another type of standard score), which represent a predicted value that is corrected for demographic effects using linear and non-linear analyses [
22]. This method has been used in large samples (
N > 1000) [
22]. Importantly, this method actually
eliminates demographic effects, while demographically stratified
z-scores do not. This means that demographically corrected
T-scores provide the closest approximation to the individual’s personal circumstances and therefore produce the
most accurate disease-related effect. Performance in
large normative datasets is typically distributed according to the Normal curve, especially when averaged across several neuropsychological measures (as in a cognitive domain). Performance at the lower tail of the distribution can be defined as impaired according to statistical criteria. In other words, in a non-clinical sample, this level of performance is not abnormal per se, but represents
lower
normal performance. As such, equating impaired performance in clinical and normal groups is not correct (another concept that was not operationalized correctly in the Meyer et al. study [
10]). This is especially true when using demographically corrected
T-scores because the level of impairment is primarily a reflection of a disease effect in the clinical sample. Importantly, because an increasing number of RCTs addressing prevention and treatment of HAND are likely to be conducted in low- and middle-income settings, funding for the establishment of normative data in these countries will be needed.