Coefficients alpha, beta, omega, and the glb: Comments on Sijtsma

Автор: 

Coefficients alpha, beta, omega, and the glb: Comments on Sijtsma // Psychometrika VOL. 74, NO. 1, 145–154. - MARCH 2009.

COEFFICIENTS ALPHA, BETA, OMEGA, AND THE GLB: COMMENTS ON SIJTSMA

WILLIAM REVELLE

DEPARTMENT OF PSYCHOLOGY, NORTHWESTERN UNIVERSITY

RICHARD E. ZINBARG

DEPARTMENT OF PSYCHOLOGY, THE FAMILY INSTITUTE AT NORTHWESTERN
UNIVERSITY, NORTHWESTERN UNIVERSITY

There are three fundamental problems in Sijtsma (Psychometrika, 2008): (1) contrary to the name, the glb is not the greatest lower bound of reliability but rather is systematically less than ωt (McDonald, Test theory: A unified treatment, Erlbaum, Hillsdale, 1999), (2) we agree with Sijtsma that when considering how well a test measures one concept, α is not appropriate, but recommend ωt rather than the glb, and (3) the end user needs procedures that are readily available in open source software.

Key words: reliability, internal consistency, homogeneity, test theory, coefficient alpha, coefficient omega, coefficient beta.

The problem of how to assess reliability has been with us ever since Spearman (1904) introduced the concept of correction for attenuation and that of split half reliability (Spearman, 1910). To Spearman (1904), reliability was used as a way of finding the “real correlation between the true objective values” (rpq) by correcting observed correlations (rp′q′) for the attenuation of “accidental” deviations of observed scores from their “true objective values.” To Spearman (1904, p. 90), this required finding “the average correlation between one and another of these independently obtained series of values” (what has come to be called parallel tests) to estimate the reliability of each set of measures (rp′p′ , r′q′q), and then to find

Rephrasing Spearman (1904, 1910) in more current terminology (Lord & Novick, 1968; McDonald, 1999), reliability is the correlation between two parallel tests where tests are said to be parallel if for every subject, the true scores on each test are the expected scores across an infinite number of tests, and thus the same, and the error variances across subjects for each test are the same. Unfortunately, “all measurement is befuddled by error” (McNemar, 1946, p. 294).

Error may be defined as observed score−true score, and hence to be uncorrelated with true score and uncorrelated across tests. Thus, reliability is the fraction of test variance that is true score variance. However, such a definition requires finding a parallel test. For just knowing the correlation between two tests, without knowing the true scores or their variance (and if we did, we would not bother with reliability), we are faced with three knowns (two variances and one covariance), but ten unknowns (four variances and six covariances).

In this case of two tests, by defining them to be parallel with uncorrelated errors, the number of unknowns drops to three and reliability of each test may be found.With three tests, the number of assumptions may be reduced, and if the tests are tau (τ ) equivalent (each test has the same true score covariance), reliability for each of the three tests may be found.With four tests, to find the reliability of each test, we need only assume that the tests all measure the same construct (to be “congeneric”), although possibly with different true score and error score variances (Lord & Novick, 1968).

Unfortunately, with rare exceptions, we normally are faced with just one test, not two, three, or four. How then to estimate the reliability of that one test? The original solution was to estimate reliability based upon the correlation between two halves (r1) correcting for the fact they were half tests rather than full tests using a special case (n = 2) of the more general Spearman–Brown correction (Brown, 1910; Spearman, 1910)

Subsequent efforts were based on the domain sampling model in which tests are seen as being made up of items randomly sampled from a domain of items (Lord, 1955, made the distinction between “Type 1” sampling of people, “Type 2” sampling of items, and “Type 12” sampling of persons and items). The desire for an easy to use “magic bullet” based upon the domain sampling model has led to a number of solutions (e.g., the six considered by Guttman, 1945), of which one, coefficient alpha (Cronbach, 1951) is easy to compute and easy to understand. The appeal of α was perhaps that it was the average of all such random splits (Cronbach, 1951).

Even though the pages of Psychometrika have been filled over the years with critiques and cautions about coefficient α and have seen elegant solutions for more appropriate estimates, few of these suggested coefficients are used. This is partly because they are not easily available in programs for the end user nor described in a language that is accessible to many psychologists. In a statement reminiscent of Spearman’s observation that “Psychologists, with scarely an exception, never seem to have become acquainted with the brilliant work being carried on since 1886 by the Galton–Pearson school” (Spearman, 1904, p. 96), Sijtsma (2008) points out that psychometrics and psychology have drifted apart as psychometrics has become more statistical and psychologists have remained psychologists.Without clear discussions of the alternatives and easily available programs to find the alternative estimates of reliability, most psychologists will continue to use α. With the advent of open source programming environments for statistics such as R (R Development Core Team, 2008), that are easy to access and straightforward to use, it is possible that the other estimates of reliability will become more commonly used.

What coefficients should we use? Sijtsma (2008) reviews a hierarchy of lower bound estimates of reliability and in agreement with Jackson and Agunwamba (1977) and Woodhouse and Jackson (1977) suggests that the glb or “greatest lower bound” (Bentler & Woodward, 1980) is, in fact, the best estimate. We believe that this is an inappropriate suggestion for at least three reasons:

  1. Contrary to what the name implies, the glb is not the greatest lower bound estimate of reliability, but is somewhat less than another, easily calculated and understood estimate of reliability (ωtotal,ωt) of McDonald (1999). (We use the subscript on ωt to distinguish between the coefficient ω introduced by McDonald (1978), equation (9), and McDonald (1999), equation (6.20) that he also called ω and which we (Zinbarg, Revelle, & Yovel, 2005) previously relabeled ωhierarchical,ωh).
  2. Rather than just focusing on the greatest lower bounds as estimates of a reliability of a test, we should also be concerned with the percentage of the test that measures one construct. As has been discussed previously (Revelle, 1979; McDonald, 1999; Zinbarg et al., 2005), this may be estimated by finding ωh, the general factor saturation of the test (McDonald, 1999; Zinbarg et al., 2005), or the worst split half reliability of a test (coefficient beta, β, of Revelle, 1979).
  3. Although it is easy to estimate all of the Guttman (1945) lower bounds, as well as β, ωh, and ωt, the techniques for estimating the glb are not readily available for the end user.

1. The Ordering of Reliability Estimates

Defined as the correlation between a test and a test just like it, reliability would seem to require a second test. The traditional solution when faced with just one test is to consider the internal structure of that test. Letting reliability be the ratio of true score variance to test score variance, or alternatively, 1−the ratio of error variance to true score variance, the problem becomes one of estimating the amount of error variance in the test. That is, two tests, X, and a test just like it, X′, with covariance, Cxx′ can be represented as
and letting Vx = 1Vx1′ and CXX′ = 1CXX′1′ the correlation between the two tests will be

Although arguing that reliability was only meaningful in the case of test-retest, Guttman (1945) may be credited with introducing a series of lower bounds for reliability, each based upon the item characteristics of a single test. These six have formed the base for most of the subsequent estimates.

All of these estimates assume that the covariances between items represent true covariance, but that the variances of the items reflect an unknown sum of true and unique variance. That is, the variance of a test is simply the sum of the true covariances and the error variances:

Vx = 1Vx1′ = 1Ct1′ + 1Ve1′ = Vt +Ve

and the structure of the two tests seen in (3) becomes