1 1. Introduction

Probably no other statistic has been reported more often as a quality indicator of test scores than Cronbach’s (1951) alpha coefficient, and presumably no other statistic has been subject to so much misunderstanding and confusion. Two problems concerning alpha continue to be pervasive in test construction and test use. The first problem is twofold. Alpha is a lower bound to the reliability, in many cases, even a gross underestimate, and alpha cannot have a value that could be the reliability based on the usual assumptions about measurement error. Better alternatives to alpha exist but are hardly known, let alone used to assess reliability. Thus, by continuing to use alpha as the estimate of reliability test constructors and test users do themselves injustice until they recognize the availability of better alternatives. The second problem is that alpha is persistently and incorrectly taken to be a measure of the internal structure of the test, and hence as evidence that the items in the test “measure the same thing.” However, alpha does not provide the researcher with this sort of information. The result of this misinterpretation of alpha is that due to a high alpha value, trait validity (Campbell, 1960) often is taken for granted when, in fact, it has not been investigated at all.

Because alpha continues to be so important, in particular to practical researchers, and because alpha continues to be the subject of so much misinterpretation, there appears to be a strong need to settle some issues and provide suggestions for the practical estimation of test score reliability and the assessment of what the test measures. The goal of this paper is to illuminate the flaws and fallacies that surround both the “common” knowledge base and the practical use of Cronbach’s alpha, and to provide alternatives. This paper is also meant to invite debate on topics that psychometricians often seem to overlook and test constructors and test practitioners tend to take for granted. The paper only uses knowledge that has been around for a while, but somehow has failed to come through well enough.

The paper is organized as follows. First, some historical facts about Cronbach’s alpha are discussed. Second, the definitions of test score reliability, the greatest lower bound (glb; e.g., Woodhouse & Jackson, 1977) to the reliability, and alpha are discussed, and the relationships between alpha, the glb, and the reliability are outlined. Third, the application to real data of alpha, the greater lower bound Guttman’s (1945) λ 2, and the glb is discussed, and it is shown that also in real data both alpha and λ 2 can be considerably smaller than the glb. Fourth, it is explained how alpha came to be misunderstood as a measure of internal consistency and it is shown that, in general, alpha does not convey information on the internal structure of the test. Fifth, it is argued that reliability estimates based on a single test administration, like alpha, may not convey much information about the accuracy of individual test performance. This contribution ends with five conclusions about the usefulness of alpha and alternative reliability estimation procedures.

2 2. Historical Facts

As happens so often, great inventions do not carry the name of their inventor, but instead of the researcher who was most successful in outlining its favorable properties in such a way that all of a sudden everything seemed to fall into place. It is no different with Cronbach’s alpha. To avoid misunderstandings, Cronbach (1951) himself did not claim alpha to be his invention, but at great length credited results with respect to alpha to other authors. These authors include Kuder and Richardson (1937), who published a version of alpha for dichotomous items that went under the name of KR20. Another author is Hoyt (1941), who proposed a method for estimating reliability based on an analysis-of-variance decomposition of the data, which for dichotomous items gives the same results as KR20. Guttman (1945) derived alpha, denoted by the indexed Greek lower case λ 3, as the third in a series of six coefficients each of which was shown to be a lower bound to the reliability (also, see Jackson & Agunwamba, 1977). The derivation of alpha and the other coefficients used continuous random variables for item scores, and thus include dichotomous and ordered polytomous scoring as special cases. For dichotomously scored items, KR20, and to some extent a computationally convenient approximation denoted KR21—notice there were no computers in those days—actually gained quite some fame, but they were gradually pushed aside as alpha conquered territory.

Ever since its publication in 1951 in Psychometrika, Cronbach’s famous paper has been a landmark to psychometricians, test constructors and test practitioners. Today, the paper still is one of the most downloaded papers from Psychometrika’s website (accessible via http://www.springer.com). Web of Science reports over 6,500 citations, which is a crushing number compared to the already very respectable 400+ citations for Kuder and Richardson (1937) and 200+ citations for Guttman (1945). For what it is worth (perhaps biological articles become outdated quicker than psychometric articles), it even outranks Watson and Crick’s famous 1953 Nature article in which they describe the discovery of the double helix structure of DNA. Almost no psychological test or inventory is published without alpha being reported (usually without reference to Cronbach’s paper), often for each interesting subgroup separately. Also, alpha continues to receive interest in psychometric research. For example, Van Zyl, Neudecker, and Nel (2000), Kistner and Muller (2004), and Hayashi and Kamata (2005) in different ways addressed the distribution of alpha; Ten Berge and Sočan (2004) discussed the relationship between alpha and other lower bounds and test unidimensionality; and Zinbarg, Revelle, Yovel, and Li (2005) compared alpha with several other methods for estimating test score reliability. Each of these papers was published in Psychometrika, but other papers have appeared recently in other mainstream methodological journals (e.g., Raykov, 2001; Rodriguez & Maeda, 2006). Also, critical discussions of uses and abuses of alpha have appeared in substantive journals (e.g., Cortina, 1993; Schmitt, 1996).

3 3. Reliability, the Greatest Lower Bound, and Alpha

3.1 3.1. Test Score Reliability and the Greatest Lower Bound

The definition of reliability is based on parallel test forms (Novick, 1966; Novick & Lewis, 1967; also, see Lord & Novick, 1968). Let random variable X j denote the score on item j; for example, X j = 0, 1 for incorrect/correct scoring typical of performance tests, and X j = 0, …, m for ordered rating scales typical of behavior assessment. The test contains J items. A much-used summary of the item scores is the total score or test score, which is defined as

$${X_ + } = \sum\limits_{j = 1}^J {{X_j}} $$

. Let respondents be indexed by i, such that X +i denotes respondent i’s total score.

Test score X +i is assumed to suffer from random measurement error. Thus, rather than X +i , one would like to know respondent i’s test score without error. This error-free test score is defined operationally, which means, technically and without reference to a situation in real life, as the expectation of X +i across the propensity distribution of independent repetitions of the test to individual i, that is, as ɛ(X +i ) (Lord & Novick, 1968, pp. 29–30). The expectation is better known as the true score, T i , such that

$${T_i} = \varepsilon ({X_{ + i}})$$

. Because the true score is a real number, it could never be the result of adding integer item scores; thus, the + sign does not appear with T. In item response theory (IRT), the propensity distribution appears in the stochastic subject formulation of response behavior at the level of individual items (Holland, 1990).

The difference between a test score resulting from a single test administration and the true score is defined to be the random measurement error,

$${E_i} = {X_{ + i}} - T$$

, which is a real number and like T does not carry the + sign. Because measurement errors are assumed to originate unpredictably from a random process, they correlate 0 with any other variable unless they are part of that variable (such as X +i ). Also, in a population of respondents for which only one test score is available, it is assumed that measurement errors correlate 0 with any other variable unless they are part of that variable.

Parallel tests represent a mathematical definition of independent repetitions of the same test under the same circumstances. Two tests, with test scores X + and X+, are parallel if

$$(1){\rm{ }}{T_i} = {{T'}_i},{\rm{for all }}i$$

, and, denoting variance by σ 2, if also

$$(2){\rm{ }}\sigma _{{X_ + }}^2 = \sigma _{{{X'}_ + }}^2$$

.

Thus, (1) an individual has the same long-run test performance on both tests, and (2) the variance of the test scores in the population is the same for both tests. It can easily be shown that these two properties imply that parallel tests have exactly the same psychometric properties. For example, the correlation of X + and X+ with any other independently measured variable Y is the same for both tests, implying equal validity. The only difference resides in the test scores themselves: that is, in general X +i X+i , which is due to random measurement error.

The reliability of the test score X + in the population of interest is defined as the product-moment correlation between the scores on X + and the scores on a test parallel to this test with scores denoted by X+, and the reliability is denoted by \({\rho _{{X_ + }{{X'}_ + }}}\). Because one test is parallel to the other, the correlation between the test scores gives the reliability of both X + and X+ separately. A well-known result is that \(0 \le {\rho _{{X_ + }{{X'}_ + }}} \le 1\). It can further be shown that \(\sigma _{{X_ + }}^2 = \sigma _T^2 + \sigma _E^2\) (and, likewise, \(\sigma _{{{X'}_ + }}^2 = \sigma _{T'}^2 + \sigma _{E'}^2\),), and that consequently for test score X +,

$$ \rho x_ + x_ + ^\prime = \frac{{\sigma _T^2 }} {{\sigma _{x + }^2 }} = 1 - \frac{{\sigma _E^2 }} {{\sigma _{X + }^2 }} $$
((1))

. Thus, three interchangeable ways of saying that reliability is higher, are that parallel test forms correlate higher, true score variance is greater relative to test score variance, and error variance is smaller relative to test score variance.

Equation (1) shows that the reliability can be estimated if two parallel versions of the test are available or if the true score variance (or, equivalently, the error variance) is available on the basis of one test administration. Because these possibilities are unattainable in practical test research, many alternatives have been proposed (e.g., Guttman, 1945; Nunnally, 1978) that use the data available from a single test administration. The most instructive method is the glb (e.g., Bentler & Woodward, 1980; Jackson & Agunwamba, 1977; Woodhouse & Jackson, 1977). Ten Berge and Sočan (2004) explain the glb as follows. The interitem covariance matrix for observed item scores, C X , is decomposed into the sum of the interitem covariance matrix for item true scores, C T , and the interitem error covariance matrix \({{\rm{C}}_E}:{{\rm{C}}_X} = {{\rm{C}}_T} + {{\rm{C}}_E}\). The interitem error covariance matrix C E is diagonal with error variances on the main diagonal and off-diagonal zeroes reflecting that errors correlate zero with any other variable in which they are not included. All three matrices are positive semidefinite (psd; i.e., they do not have negative eigenvalues). The glb problem is solved by finding the nonnegative matrix C E for which \({{\rm{C}}_T} = {{\rm{C}}_X} - {{\rm{C}}_E}\) is psd that minimizes

$${r_{{X_ + }{{X'}_ + }}} = 1 - {{{\rm{tr}}({{\rm{C}}_E})} \over {S_{{X_ + }}^2}}$$

. This is the glb because it represents the smallest reliability possible given observable covariance matrix C X under the restriction that the sum of error variances is maximized for errors that correlate 0 with other variables. Thus, the data obtained from one test administration restrict the real reliability to the interval [glb, 1]. This means that when the glb is found to be 0.8, the true reliability has a value in the interval [0.8; 1]. Thus, data from a single test administration restrict the reliability to an interval, whereas data from two parallel tests would yield a point estimate of the reliability. Algorithms for solving the glb problem are discussed by Bentler and Woodward (1980) and Ten Berge, Snijders, and Zegers (1981).

3.2 3.2. Definition of Alpha

Let σ 2 j denote the variance of item score X j and σ jk the covariance between item scores X j and X k . Alpha is defined as

$$alpha = {J \over {J - 1}}\left[ {1 - {{\sum\nolimits_{j = 1}^J {\sigma _j^2} } \over {\sigma _{{X_ + }}^2}}} \right]$$

, or equivalently as

$$alpha = {J \over {J - 1}}{{\sum {\sum\nolimits_{j \ne k} {{\sigma _{jk}}} } } \over {\sigma _{{X_ + }}^2}}$$

.

This latter form proves to be useful later on. It may be noted that calpha ≤ 1, with c < 0 if the mean interitem covariance among the J items is negative. This is known to happen sometimes due to accidentally coding both positively and negatively worded personality or attitude items in the same direction.

3.3 3.3. Relationship Between Alpha, the glb, and Reliability

Guttman (1945, p. 274) proved that alpha—his λ 3 coefficient—is a lower bound to the reliability, that is, he proved that for J items,

$$alpha \le \rho {X_ + }{{X'}_{\_ + }}$$

.

Novick and Lewis (1967, Theorem 3.1) proved that \(alpha \le {\rho _{{X_ + }{{X'}_ + }}}\) holds if and only if the items in the test are essentially τ-equivalent (τ is sometimes used to denote the true score, i.e., τ = T). Essential τ-equivalence is another mathematical definition of the similarity of different tests (here, items are considered as 1-item tests) that is less restrictive than parallelism. For items j and k, and constant a jk , essential τ-equivalence is defined as

$${T_j} = {T_k} + {a_{jk,}}{\rm{ for all item pairs }}j \ne k$$

.

Essential τ-equivalence implies that interitem covariance ρ jk is the same for all item pairs (jk), and that covariance ρ jY is the same for all items (j = 1, …, J) and any independently measured variable Y. Like parallelism, essential τ-equivalence is not a realistic condition in test data, so that in real data we have that \(alpha \prec {\rho _{{X_ + }{{X'}_ + }}}\).

The glb relates to alpha and the reliability as

$$alpha \le {\mathop{\rm glb}} \le {\rho _{{X_ + }{{X'}_ + }}}$$

.

Equation (3) is true because alphaglb (Jackson & Agunwamba, 1977), and by definition \({\mathop{\rm glb}} \le {\rho _{{X_ + }{{X'}_ + }}}\). We know that \(alpha = {\rho _{{X_ + }{{X'}_ + }}}\) if and only if the items are essentially τ-equivalent. Also, \({\mathop{\rm glb}} = {\rho _{{X_ + }{{X'}_ + }}}\) if the items are essentially τ-equivalent but equality can also be obtained under other conditions. For example (Ten Berge, personal communication), one may use covariance matrix

$${{\rm{C}}_X} = \left[ {\matrix{ 1 & 1 & 2 \cr 1 & 2 & 3 \cr 2 & 3 & 5 \cr } } \right]$$

and (7b) from Ten Berge and Sočan (2004) to verify that glb = 1. This result implies that \({\mathop{\rm glb}} = {\rho _{{X_ + }{{X'}_ + }}}\) even though the covariances in (4) are unequal, which violates essential τ-equivalence. Thus, it follows that C X = C T (also, see Ten Berge & Sočan, 2004, (5), first part) and C E = 0. Also, notice that in this example alpha = 0.9.

Equation (3) shows that for real data alpha is not in the interval [glb, 1] of admissible values, and the conclusion can only be that for any observable covariance matrix C X alpha provides a value that cannot be a possible value of the reliability based on the knowledge provided by one test administration. One could argue that it does not hurt to use a small lower bound like alpha in practice, because unnecessary low reliability estimates may have the positive effect of stimulating the researcher to do all (s)he can to construct a high-quality test. Much as this is true, following this same line of reasoning accepting an even smaller lower bound such as Guttman’s (1945) λ 1 coefficient (λ 1 < alpha for finite test length) would even boost that effect. Perhaps it is more reasonable to ask why one would report an estimate of the reliability in the face of much better alternatives, the most prominent one being the glb. Schmitt (1996) warns that lower bounds like alpha may produce gross overestimates of the correlation between test scores when they are corrected for attenuation. The real-data example reported in the next subsection shows that using the glb instead of alpha or another lower bound can make a difference indeed.

3.4 3.4. Alternatives for Alpha

Borsboom (2006) noted that the degree to which a statistical method is used in empirical research very much depends on its availability in SPSS. Alpha is in SPSS, and so are the other five lower bounds proposed by Guttman (1945). One of them is known under the name of λ 2, and is sometimes reported instead of alpha. Guttman (1945) proved that alphaλ 2. Alpha and λ 2 are the first two terms of an infinite series of lower bounds in which they are denoted by μ 0 and μ 1 (Ten Berge & Zegers, 1978), respectively. Ten Berge and Zegers (1978) concluded that computing lower bounds from their series beyond λ 2 usually does not produce increases that are worthwhile reporting.

Coefficient λ 2 relates to alpha, the glb, and the reliability as

$$alpha \le {\lambda _2} \le {\mathop{\rm glb}} \le {\rho _{{X_ + }{{X'}_ + }}}$$

.

In (5), we have that λ 2glb (Jackson & Agunwamba, 1977). Except for the relationship between glb and \({\rho _{{X_ + }{{X'}_ + }}}\), (5) contains equalities if and only if the items are essentially τ-equivalent. Equality between glb and \({\rho _{{X_ + }{{X'}_ + }}}\) can also be obtained under different conditions; see (4) for an example. Information about the sampling characteristics of lower bounds is available from several sources (e.g., Feldt, Woodruff, & Salih, 1987). The glb estimate may be biased positively even for samples as large as 1,000 cases, but bias seems to be rather small as the number of items is smaller than 10 (Ten Berge & Sočan, 2004).

The three lower bounds alpha, λ 2, and the glb were computed for a real-data example. The data came from a questionnaire that consists of eight rating scale items, scored 0, 1, 2, 3, by 828 respondents. Each item asked respondents who lived in the vicinity of a malodorous factory how they coped with industrial malodors (Cavalini, 1992). The dimensionality of the data was investigated using principal components analysis (PCA). Researchers typically use PCA to investigate the dimensionality of the data, but better methods may be available to be discussed later on. Alpha, λ 2, and PCA and Varimax rotation were computed by means of SPSS 14.0 (2006), and the glb was computed by means of the program MRFA2.exe (Ten Berge & Kiers, 2003) that can be downloaded from http://www.ppsw.rug.nl/~kiers/.

PCA resulted in eigenvalues for the inter-item correlation matrix R X of 3.213, 1.103, and the next six each smaller than 1. The second component showed the contrasts typically suggesting that a rotated 2-factor solution would better explain the correlation structure despite only one eigenvalue being markedly greater than 1. Indeed, Varimax rotation of the first two components resulted in one set of three items with loadings on the first factor greater than 0.7, and another set of four items with loadings on the second factor greater than 0.6 (Table 1). One item had a loading of approximately 0.5 on both factors, but based on content it went better with the items loading highest on the first factor.

Table 1 Factor loadings for eight items measuring coping styles.

For all eight items considered to be in one scale, alpha is 0.007 smaller than λ 2, but λ 2 is 0.067 smaller than the glb (Table 2). The lower bound values for the first of the two 4-item scales resulting from the PCA were nearly as high as those found for the 8-item scale, but the lower bounds for the second 4-item set were smaller by approximately 0.15. Still, λ 2 was 0.074 smaller than the glb in the first scale, and 0.052 smaller than the glb in the second scale. The gap between alpha/λ 2 and the glb was caused by the spread in the interitem covariances (Table 3). This violation of a necessary condition for essential τ-equivalence prevented the three lower bounds from being equal.

Table 2 Lower bounds alpha, λ2, and the glb, Total Observed Variance (TotObsVar), Total Common Variance (TotComVar), and Explained Common Variance (ECV) for an 8-item scale and two 4-item scales.
Table 3 Covariance matrix for eight items. Items 5, 7, 9, 11 are in Set 1; and items 3, 6, 13, 14 are in Set 2. Interitem covariance within these sets in bold face.

The differences reported here are believed to be of practical interest to test constructors and researchers who report a reliability estimate for their test or questionnaire. Moreover, in this data set, there is no convincing reason to report unnecessary small reliability estimates.

4 4. Alpha as Measure of Internal Consistency

4.1 4.1. Drifting Away From Reliability to Internal Consistency

When Cronbach published his classical article in 1951, it was already known that alpha was a lower bound to the reliability, but it is important to realize that at that time several definitions of test score reliability, true score, and random measurement error existed next to one another. The widely accepted foundation of classical test theory as provided later on by Novick (1966) and Lord and Novick (1968) was unknown then. Thus, Cronbach (1951, p. 299) could write: “It has generally been stated that α (i.e., Cronbach’s alpha; the author) gives a lower bound to the “true reliability”—whatever that means to that particular writer.” As a result, the concept of a lower bound did not seem as compelling to Cronbach as it is nowadays and, instead, much of Cronbach’s paper was not about alpha as a lower bound but about analyzing the relationships of alpha with correlations between similar test forms (“similar” is different here from parallel), test-retest correlation, and split-half correlation, and with the factorial composition of the test. This produced several interesting results that were picked up by many psychologists and led to the interpretation of alpha as a measure of the internal consistency of a test. It is safe to say that the interpretation of alpha as a measure of internal consistency has gained more foothold in practical test construction and test use than the lower bound interpretation. Before I try to explain this preference, I first ask what internal consistency is.

Schmitt (1996) distinguishes internal consistency from homogeneity, and claims that internal consistency refers to the interrelatedness of a set of items, and homogeneity to the unidimensionality of a set of items. However, this distinction does not convincingly solve terminological confusion. To start with, unidimensionality is not a unitary concept. The concept plays a role both in factor analysis and IRT, and has been defined in different ways. There are similarities, however. Lord and Novick (1968, p. 374, Theorem 16.8.1) proved that if J dichotomous items originate from different dichotomizations of J normal distributions of latent continuous item scores, which have a rank 1 covariance matrix, then the regression of each item on the latent trait is a 2-parameter normal ogive. Takane and De Leeuw (1987) studied the relationship between the factor model and normal ogive IRT models in a more general framework. Independent of factor models, within the class of different unidimensional logistic IRT models such as the 1-, 2-, and 3-parameter models, each model imposes different restrictions on the data, and each model may be seen as representing another definition of unidimensionality. Thus, it seems that in general the concept of unidimensionality is tied to a particular model and in this sense it is clear what unidimensionality means under that particular model.

Internal consistency has not been defined that explicitly, far from it. For example, Cronbach (1951, p. 320) used internal consistency and homogeneity synonymously (cf. Schmitt, 1996), and noted that an internally consistent test is “psychologically interpretable” although this does not mean “that all items be factorially similar.” In the jargon of test construction internal consistency often refers to the items “being interrelated” (Schmitt, 1996) but other interpretations are also used regularly. In practical test construction, the use of alpha often goes hand-in-hand with PCA (e.g., Cavalini, 1992; De Hooge, Zeelenberg, & Bruegelmans, 2007). A pervasive albeit informal interpretation of the test’s internal consistency is that the first eigenvalue of the interitem correlation matrix is high relative to the second eigenvalue but exactly how high is unclear. This interpretation indeed is different from equating internal consistency with a 1-factor solution or with IRT unidimensionality and leaves open the possibility that different items have varying patterns of factor loadings if more than one factor is retained. This comes close to Cronbach’s remark that items need not be factorially similar for the test to be internally consistent. But what this analysis does best is underline the vagueness of the internal consistency concept.

This vagueness has not stopped alpha from becoming a landmark for internal consistency. Remarkably, however, is that a glance at alpha shows that all other things kept equal, its value depends only on the sum of the interitem covariances (2). Thus, all that alpha can reveal about the “interrelatedness of the items” is their average degree of “interrelatedness” provided there are no negative covariances, and keeping in mind that alpha also depends on the number of items in the test (Nunnally, 1978, pp. 227–228). Because this says very little if anything about internal consistency no matter how it is defined, one wonders why the internal consistency interpretation of alpha is so persistent. I believe that there are two related reasons.

The first reason is that while several studies have well illuminated the relationships of alpha to other quantities (e.g., Cortina, 1993; Green, Lissitz, & Mulaik, 1977; also see Cronbach, 1988), in particular the factor structure of the test, they have also conveyed the impression that because alpha has something to do with the test’s factor structure, its value therefore must express characteristics of this factor structure. This conclusion is logically incorrect and usually not intended by these studies, but it probably has been too compelling to many test constructors and test practitioners to resist. A single number—alpha—that expresses both reliability and internal consistency—conceived of as an aspect of validity that suggests that items “measure the same thing”—is a blessing for the assessment of test quality. In the meantime, alpha “only” is a lower bound to the reliability and not even a realistic one.

The second reason is that after the 1950s, psychometrics has developed to become more mathematically and statistically oriented while psychologists primarily have remained psychologists. One can argue whether psychologists should become better statisticians or whether psychometricians should become better psychologists (Borsboom, 2006), but it is a fact that the two worlds have drifted apart more than anyone should wish. Thus, while much of Cronbach’s paper was and still is accessible to many psychologists, the work by Lord, Novick, and Lewis and many others since may have gone unnoticed by most psychologists. This is truly an example of the gap that has grown between psychometrics and psychology and that prevents new and interesting psychometric results, including those that relate alpha to the glb and the test’s factor structure, to seep into mainstream psychology.

4.2 4.2. Alpha and Internal Test Structure

There is no clear and unambiguous relationship between alpha and the internal structure of a test. This can be demonstrated in a simple way. First, it is shown that a 1-factor test may have any alpha value. Thus, it may be concluded that the value of alpha says very little if anything about unidimensionality. Second, it is shown that different tests of varying factorial composition may have the same alpha value. Thus, it may be concluded that alpha says very little if anything about multiple-factor item structures.

Alpha and Unidimensionality Equal item variances, equal interitem covariances and, consequently, equal interitem correlations are necessary (but not sufficient) for parallel items. Ten Berge and Kiers (1991) advocated the use of minimum rank factor analysis (MRFA) for assessing closeness of the covariance/correlation matrix to unidimensionality. For a 1-factor solution, MRFA determines the diagonal uniquenesses covariance matrix C E , which produces the smallest sum of the J-1 smallest eigenvalues of the difference matrix C X -C E . Thus, the amount of common variance that is left unexplained when the last J-1 factors are ignored is minimized, and as a result, the 1-factor solution is the “most-unidimensional” factor solution.

Closeness of the 1-factor solution to unidimensionality is assessed by means of the ratio of the first eigenvalue of C X -C E and the sum of all J eigenvalues of C X -C E (Ten Berge & Sočan, 2004). After transforming this ratio to a percentage, the explained common variance (ECV) is obtained. Instead of MRFA and the ECV, test constructors often use PCA and the percentage of observed variance (POV) corresponding to the first eigenvalue extracted by PCA from the correlation matrix R X . It should be noted that PCA is based on C E = 0, and thus provides the “least-unidimensional” factor solution in terms of eigenvalues corresponding to C X -C E .

A 1-factor item structure was operationalized in each of seven tests, each test consisting of six items (J = 6) with item variances all equal to σ 2 j = 0.25 (j = 1, …, 6) and equal positive interitem covariances σ jk . Across tests, from a practical point of view covariances varied from high (σ jk = 0.15; corresponding to pm-correlation σ jk = 0.6) to low (σ jk = 0.01; corresponding to pm-correlation σ jk = 0.04). For each of the seven interitem correlation matrices, MRFA was done by means of the program MRFA2.exe (Ten Berge & Kiers, 2003), and PCA was done by means of SPSS 14.0 (2006) syntax code. MRFA.2.exe also produces the glb. Alpha was computed by means of SPSS 14.0 syntax code, and was compared to the glb.

ECV is 100% for all seven interitem correlation matrices but POV, which is the quantity that most test constructors use for assessing unidimensionality, starts at 66.67% and then drops gradually to 20% (Table 4). Thus, ECV indicates perfect unidimensionality whereas POV suggests factor solutions that move away from unidimensionality. This conclusion is amplified if one also takes the eigenvalues of correlation matrix R X (PCA) into consideration. Many researchers probably would take the last two or three sets of eigenvalues (Table 4) as evidence of multidimensionality (i.e., the items each correspond to unique factors) instead of unidimensionality. Because the covariance matrices are typical of essential τ-equivalence, it follows that alpha = glb. Table 4 also shows that as inter-item covariance drops while keeping everything else constant, alpha and the glb drop from 0.90 to 0.20, that is, from high to low.

Table 4 Eigenvalues (EV) of observable correlation matrix R X , percentage of observed variance (POV) explained by the first principal component, ECV, alpha, and glb, for tests with J = 6, σ 2 j = 0.25 (j = 1, …, 6), and σ jk jk constant per test and variable across tests.

The seven examples in Table 4 each represent cases of unidimensionality: From left to right, the signal becomes weaker while the noise (due to unique factors and measurement error) becomes stronger. But all the time there is one signal—unidimensionality—correctly identified by ECV = 100. The reliability quantifies the degree to which test scores can be repeated under the same circumstances. As the signal in the data becomes weaker, alpha and the glb become smaller, as they should.

Alpha and Multidimensionality Multidimensionality was operationalized by means of three tests, again each consisting of six items (J = 6), with item variances equal to σ 2 j = 0.25 (j = 1, …, 6), and interitem covariances σ jk such that: (1) they were positive and equal within clusters of items; (2) they were zero between items from different clusters; and (3) the sum of all J(J-1) covariances was constant across different matrices C X . Condition 3 implies the same alpha for each covariance matrix. Table 5 shows the lower triangles of the covariance matrices C X with three 2-item clusters, two 3-item clusters, and one 6-item cluster, respectively.

Table 5 Covariance matrices C X , EVs based on corresponding correlation matrix R X , ECV, glb, and alpha.

The first two sets of eigenvalues from R X each suggest the correct dimensionality of the tests, while the ECV shows that R X is remote from unidimensionality. The third set of eigenvalues would probably lead several researchers to conclude that there is one common albeit weak factor, but the ECV suggests perfect unidimensionality. Coefficient alpha equals 0.533 for all three covariance matrices, irrespective of dimensionality. Interestingly, the glb is highest for the 3-factor case and lowest for the 1-factor case (in the latter case, the glb coincides with alpha because C X satisfies a necessary condition for essential τ-equivalence; also, see Table 4). More important, alpha does not provide information on the internal structure of the test as it is so often claimed.

Going back to the real-data example discussed previously, it is interesting to see (Table 2) that ECV for the 8-item scale suggests that the scale is remote from unidimensionality. Both 4-item scales have high ECV values suggesting near-unidimensionality, but once more it is clear that unidimensionality or lack thereof has nothing to do with reliability.

Moreover, alpha depends on the number of items J, and our examples can be adapted simply to show that alpha grows as J grows (Cortina, 1993; Green et al., 1977). For example, for J = 12, σ 2 j = 0.25 (j = 1, …, 12), and covariance structures with three 4-item clusters (σ jk = 0.20 within clusters), two 6-item clusters (σ jk = 0.12) and one 12-item cluster (σ jk = 0.0545454) such that each time \(\sum {\sum\nolimits_{j \ne k} {{\sigma _{jk}} = 7.2} } \), alpha = 0.770 in all three cases.

5 5. Is There a Future for Alpha?

Lord and Novick (1968) discussed reliability as repeatability of individual test performance described by the individual’s propensity distribution. The propensity distribution shows the influence of random measurement error across an infinite number of parallel test administrations. However, due to the practical impossibility to administer the same test to the same individuals repeatedly—even twice is nearly impossible—one has to resort to a random sample of individuals who have been administered the test once, and then estimate the reliability on the basis of this single administration. The glb shows that such data limit the range of possible reliability values to [glb, 1] but also that a perfect reliability cannot be ruled out one the basis of one test administration. An interesting question is whether single-administration test data can provide information about individuals’ propensity distributions at all.

Molenaar (2004; also, see Borsboom, 2005, pp. 68–81) noted that in general a single-administration sample of test scores does not contain information about the individuals’ propensity distributions unless both types of distributions—between individuals as in single-administration data and within individuals as in propensity distributions—obey restrictive distributional properties. He contended that most psychological phenomena do not agree with these assumptions. Other authors also noticed that statements about individuals are problematic when only single-administration data are available. For example, Ellis and Van den Wollenberg (1993) showed that IRT models do not hold for individuals unless the assumption of local homogeneity is added to the models. Molenaar (2004) reported that a (Big) 5-factor personality structure that was found at the group level on the basis of a sample of observations collected at one point in time did not correspond to the different factorial structures characteristic of different individuals who were repeatedly tested by means of the same personality inventory (Molenaar, 2004). This result seems to have relations to the phenomenon that particular individuals are insensitive to certain personality traits, which has become known as lack of traitedness (Tellegen, 1988). Lack of traitedness may be the cause of atypical patterns of scores on items from personality inventories (Reise & Waller, 1993).

Likewise, there is no reason whatsoever to assume that the propensity distributions of different persons must be identical to one another and to the between-persons distribution based on single-administration data. This means that single-administration test data may contain little or no information about propensity distributions. The use of the standard measurement error,

$${\sigma _E} = {\sigma _X}\sqrt {1 = {\rho _{{X_ + }{{X'}_ + }}}} $$

, in the practice of psychological testing was born out of this inherent limitation of single-administration test data. The application of the standard measurement error assumes that each individual was tested with the same accuracy but classical test theory does not make this assumption nor is there much reason to expect a priori that people would produce the same propensity distributions when given the opportunity. Indeed, Lord (1960) studied distributions of measurement errors that varied across the true score level, and IRT uses the Fisher information function to estimate a standard error dependent on the scale of measurement. Such improvements recognize the improbability of the same accuracy of measurement for every tested individual but cannot be considered realistic as long as their assumptions have not been put to the test in real data. That is, one needs to study real propensity distributions to find out how standard errors are related to the scale of measurement, and until then the results provided by Lord and IRT are properties of statistical models, not of real behavior.

The problem with discussions like this one is that while (I believe) they make a good point, the practical test user needs to make decisions about the treatment of individual clients or patients and cannot afford to sit back and wait until science comes up with the final solution. Thus, it seems best to end with a number of conclusions about alpha and reliability, and find out what is the next best thing for alpha and reliability in the absence of available propensity distributions.

6 6. Conclusions

On the basis of the previous discussion, the following five conclusions seem to be in order:

  1. 1.

    In practice, alpha attains values that are outside the range of possible values of the reliability that can be derived from a single test administration. Comparing alpha with the glb gives an impression of the degree to which alpha is wrong. The difference can easily be tenths depending on the exact properties of the test under consideration.

  2. 2.

    Many lower bounds exist between alpha and the glb, and the lower bounds proposed by Guttman (1945) are all in SPSS thus eliminating the “not in SPSS” argument often heard in practice. It is difficult to defend convincingly using one of the smallest lower bounds, alpha, given the availability of many greater lower bounds and the glb. The only reason to report alpha is that top journals tend to accept articles that use statistical methods that have been around for a long time such as alpha. Reporting alpha in addition to a greater lower bound may be a good strategy to introduce and promote a better reliability estimation practice.

  3. 3.

    The best lower bound and the only one attaining a realistic value, however, is the glb. The glb is available from several sources and easy to obtain (Ten Berge & Sočan, 2004). Because the glb can be seriously positively biased for lower reliability values, samples smaller than, say, 1,000 cases, and test lengths exceeding, say, 10 items, more work on bias correction is badly needed (e.g., Shapiro & Ten Berge, 2000; Verhelst, 1998) and psychometrics might spend more energy in favor of this just cause. Once a good bias correction is found, one cannot get around the glb anymore to replace alpha (and all other lower bounds).

  4. 4.

    Alpha is not a measure of internal consistency. Neither is it a measure of the degree of unidimensionality (also, see Ten Berge & Sočan, 2004). Alpha has been shown to correlate with many other statistics and much as these results are interesting, they are also confusing in the sense that without additional information, both very low and very high alpha values can go either with unidimensionality or multidimensionality of the data. But given that one needs the additional information to know what alpha stands for, alpha itself cannot be interpreted as a measure of internal consistency.

  5. 5.

    Statistical results based on a single test administration convey little if any information about individuals’ measurement accuracy reflected by their propensity distributions. This does not seem to be an insurmountable problem when a test is used for comparingmean scores between different groups or correlations between variables in a nomological network, but even then one has to be aware that “averaging out” the individual causes the means and correlations to lose their psychological meaning (Borsboom, 2005). For drawing conclusions about individuals on the basis of test scores, the best one can do is to use tests that consist of many items and have a reliability—be it estimated by Cronbach’s alpha—that pushes 1. More generally, it is recommended to use as much information about the individual as possible (e.g., Emons, Sijtsma, & Meijer, 2007).