# A comparison of reliability coefficients for psychometric tests that consist of two parts

- 2.6k Downloads
- 1 Citations

## Abstract

If a test consists of two parts the Spearman–Brown formula and Flanagan’s coefficient (Cronbach’s alpha) are standard tools for estimating the reliability. However, the coefficients may be inappropriate if their associated measurement models fail to hold. We study the robustness of reliability estimation in the two-part case to coefficient misspecification. We compare five reliability coefficients and study various conditions on the standard deviations and lengths of the parts. Various conditional upper bounds of the differences between the coefficients are derived. It is shown that the difference between the Spearman–Brown formula and Horst’s formula is negligible in many cases. We conclude that all five reliability coefficients can be used if there are only small or moderate differences between the standard deviations and the lengths of the parts.

## Keywords

Spearman–Brown formula Cronbach’s alpha Flanagan’s coefficient Angoff–Feldt coefficient Raju’s beta Horst’s formula## Mathematics Subject Classification

62H20 62P15 91C99## 1 Introduction

In psychometrics researchers are concerned with measuring knowledge, abilities and attitudes of persons and individuals. To measure these types of constructs investigators use measurement instruments like tests, exams and questionnaires. In this paper we will refer to any measurement instrument as a test. In test theory, an important concept of a test is its reliability, which indicates how precise a participant’s score is measured. In general, a test is said to be reliable if it produces similar scores for participants under consistent conditions. In reliability estimation a researcher wants to reflect the impact of as many sources of measurement error as possible (Feldt and Brennan 1989). For example, to reflect the day-to-day variation in efficiency of human minds it is an acknowledged principle that a researcher uses at least two interchangeable test forms (parallel-forms approach) or administers the same test twice (test–retest approach). However, because multiple testing is often considered too demanding for the participants, too time-consuming, or too costly, investigators usually do only one test administration. If there is only one test administration researchers may resort to, in the context of classical test theory (Lord and Novick 1968), internal consistency coefficients for estimating the reliability of the test. The most commonly used consistency coefficients are Cronbach’s alpha and the Spearman–Brown formula (Cortina 1993; Osburn 2000; Hogan et al. 2000; Feldt and Charter 2003; Grayson 2004; Warrens 2014, 2015).

Internal consistency coefficients estimate reliability by dividing the total test into parts. A test may already consist of multiple parts, for example, a multiple choice part and an essay part. If the test consists of a set of items, the parts can be the individual items or subsets of the items. All reliability coefficients are based on the assumption that the different parts are homogeneous in content (Feldt and Brennan 1989). However, the coefficients are based on different conceptions of how the parts are related. For the coefficients in this paper there are three relevant measurement models, namely, classical parallel, essential tau-equivalence, and congeneric. The models are further discussed in the next section.

In this paper we compare five internal consistency coefficients that can be used if the test is divided into two parts. The coefficients are, the Spearman–Brown formula (Spearman 1910; Brown 1910), Flanagan’s coefficient (Rulon 1939), Horst’s formula (Horst 1951), the Angoff–Feldt coefficient (Angoff 1953; Feldt 1975), and Raju’s beta (Raju 1977). The well-known coefficient Cronbach’s alpha reduces to Flanagan’s coefficient if we have only two parts. There are several reasons why a test can be divided in only two parts. Sometimes the requirement of content equivalence between parts limits the number of parts to two. Furthermore, in performance and educational settings tests frequently consist of a multiple choice part and an essay part. If previous research has shown that the two parts tend to measure the same construct of interest, it makes sense to leave the two parts intact in reliability estimation. Finally, with two parts we have the simplest reliability formulas and only a few statistics need to be calculated, which may also be a consideration.

The Spearman–Brown formula is based on the classical model, whereas Flanagan’s coefficient is based on the essential tau-equivalence approach. The other three coefficients can be used if the more general congeneric model holds. If a test consists of two parts the Spearman–Brown formula and Flanagan’s coefficient (Cronbach’s alpha) are commonly used, even when their associated measurement models may fail to hold. It appears that researchers are not aware that these coefficients are not universally applicable (Feldt and Charter 2003). Since it is likely that researchers will continue to use the Spearman–Brown formula and Flanagan’s coefficient in the near future, it seems useful to study how robust reliability estimation in the two-part case is to coefficient misspecification. We will do this by determining conditions under which the five coefficients produce (very) similar values, that is, conditions under which the coefficients can be used interchangeably.

Using simulated data Osburn (2000) and Feldt and Charter (2003) showed that the Spearman–Brown formula, Flanagan’s coefficient, the Angoff–Feldt coefficient and Raju’s beta produce similar values in a variety of situations. In this paper we compare the five coefficients analytically and derive upper bounds of the differences between the coefficients. The upper bounds hold under certain conditions on the standard deviations and lengths of the parts. In the process we formalize several rules of thumb presented in Feldt and Charter (2003). The paper is organized as follows. In the next section we introduce notation, discuss four measurement models, and define the five reliability coefficients. Unconditional and conditional inequalities between the coefficients are presented in Sect. 3. In Sect. 4 we study the pairwise differences between several coefficients and derive upper bounds of the differences and associated conditions. Section 5 contains a discussion.

## 2 Notation and definitions

Assumptions of four measurement models from classical test theory

Measurement model | True scores | Error terms |
---|---|---|

Parallel | \(T_1=T_2\) | \(\text{ var }(E_1)=\text{ var }(E_2)\) |

\(\text{ var }(T_1)=\text{ var }(T_2)\) | ||

Tau-equivalence | \(T_1=T_2\) | \(\text{ var }(E_1)\ne \text{ var }(E_2)\) |

\(\text{ var }(T_1)=\text{ var }(T_2)\) | ||

Essential | \(T_1=T_2+b_1\) | \(\text{ var }(E_1)\ne \text{ var }(E_2)\) |

Tau-equivalence | \(\text{ var }(T_1)=\text{ var }(T_2)\) | |

Congeneric | \(T_1=b_2T_2+b_3\) | \(\text{ var }(E_1)\ne \text{ var }(E_2)\) |

\(b^2_2\sigma ^2_T\ne b^2_3\sigma ^2_T\) |

In the parallel measurement approach it is assumed that each participant has the same true score for both parts, and that the parts have equal error variances. This parallel model is the most restrictive model. If the observed variances of the parts differ extremely then the classical parallel model fails to hold. Several alternative models relax the notion of parallelism. In the tau-equivalent approach the parts may have different error variances. Moreover, if the parts are essentially tau-equivalent it is also allowed that the true scores of a participant on the parts differ by an additive constant. This constant is the same for every participant. If the parts have substantially different lengths it is unrealistic to assume essential tau-equivalence. Finally, the congeneric model is the most general model. In this approach the true score variances of the parts are also allowed to be different.

## 3 Inequalities

In this section we present several inequalities between the five reliability coefficients. It turns out that Flanagan’s coefficient is a lower bound of the other four coefficients. Some of the inequalities below have already been demonstrated by other authors. However, in this paper we are also interested in the conditions that specify when the inequalities are equalities. These conditions are often not specified in the literature. The formulations of the inequalities in this section therefore give a more complete picture of how the reliability coefficients are related.

Raju (1977) proved the inequality \(\alpha \le \beta \).

### **Lemma 1**

\(\alpha \le \beta \) with equality if and only if \(p_1=p_2=\tfrac{1}{2}\).

### *Proof*

We have \(\alpha \le \beta \) \(\Leftrightarrow \) \(4p_1p_2\le 1\). \(\square \)

The double inequality \(\alpha \le SB\le AF\) is demonstrated in, for example, Feldt and Charter (2003).

### **Lemma 2**

\(\alpha \le SB\) with equality if and only if \(\sigma _1=\sigma _2\).

### *Proof*

### **Lemma 3**

\(SB\le AF\) with equality if \(r=1\) or if \(\sigma _1=\sigma _2\).

### *Proof*

Horst (1951) showed that \(SB\) is a special case of \(H\) for \(p_1=p_2=\frac{1}{2}\). Theorem 4 shows that the inequality \(SB\le H\) holds.

### **Theorem 4**

\(SB\le H\) with equality if and only if \(p_1=p_2=\tfrac{1}{2}\).

### *Proof*

## 4 Upper bounds of the differences

In this section we study differences between the five reliability coefficients. In the previous section we presented several inequalities between the coefficients. These results can be used to interpret positive differences between some of the coefficients. For each difference we present several upper bounds and associated conditions.

It follows from Lemma 1 that the difference \(\beta -\alpha \) is non-negative. We have the following upper bounds for the difference \(\beta -\alpha \).

### **Lemma 5**

### *Proof*

Lemma 5 shows that if there are only small differences between the lengths of the parts, Raju’s beta and Flanagan’s coefficient produce very similar values. If the longest part is one and a half times longer than the shortest part (\(|p_1-p_2|=0.20\)) the difference is at most 0.04. In many applications a difference of this size is of no practical significance.

### **Theorem 6**

### *Proof*

It follows from Lemmas 2 and 3 that the difference \(AF-\alpha \) is non-negative. We have the following upper bounds for the difference \(AF-\alpha \).

### **Theorem 7**

### *Proof*

Theorem 7 shows that for small differences between the standard deviations Flanagan’s coefficient and the Angoff–Feldt coefficient produce very similar values. If the larger standard deviation is no more than 15 % larger than the smaller (\(c\le 1.15\)) then the difference is always less than 0.02. Since the value of the Spearman–Brown formula is between the values of these two coefficients (Lemmas 2 and 3), we may conclude that for \(c\le 1.15\) the difference between the three coefficients is always less than 0.02. In many cases a difference of this size is negligible.

It follows from Theorem 4 that the difference \(H-SB\) is non-negative. We have the following upper bounds for the difference \(H-SB\).

### **Theorem 8**

### *Proof*

Theorem 8 shows that even with substantial differences between the lengths of the parts the Spearman–Brown formula and Horst’s formula produce very similar values. When the longest part is one and a half times longer than the shortest part (\(|p_1-p_2|=0.20\)) the difference is less than 0.01, a size that is negligible. Even when the longest part is three times longer than the shortest part (\(|p_1-p_2|=0.50\)) the difference is less than 0.05.

## 5 Discussion

In this paper we compared five reliability coefficients for tests that consist of two parts. The coefficients are the Spearman–Brown formula, Flanagan’s coefficient (a special case of Cronbach’s alpha), Horst’s formula, the Angoff–Feldt coefficient, and Raju’s beta. We first presented inequalities between the reliability coefficients. The inequalities were then used to formulate positive differences between the coefficients. Using analytical techniques we then derived several upper bounds of the differences between the coefficients. The upper bounds hold under certain conditions.

Criteria for qualifying the values of the differences between the coefficients depend on the context of the reliability estimation. In this paper we use the following criteria. A difference of at most 0.04 is considered to be of no practical significance. Furthermore, a value that is smaller than or equal to 0.02 is considered to be negligible. A difference between coefficients is substantial if its size is bigger than or equal to 0.10. These criteria are of course arbitrary. The reader may interpret the results using other critical values.

An interesting relationship was found between the values of the Spearman–Brown formula and Horst’s formula. Theorem 8 shows that even with substantial differences between the lengths of the parts the formulas produce very similar values. When the longest part is one and a half times longer than the shortest part the difference is less than 0.01. A difference of this size is negligible. But even if the longest part is three times longer than the shortest part the difference between the coefficients is always less than 0.05. The Spearman–Brown formula is based on the classical parallel model, and this model fails to hold with substantial differences between the lengths of the parts. Horst (1951) proposed his formula for the case that the parts have different lengths. However, in many real-life situations the difference will be negligible, although Horst’s formula will always produce a (slightly) higher value.

Lemmas 2 and 3 together with Theorem 7 show that for small differences between the standard deviations the Spearman–Brown formula, Flanagan’s coefficient and the Angoff–Feldt coefficient produce very similar values. If the larger standard deviation is no more than 15 % larger than the smaller then the difference is always less than 0.02. In this case all three coefficients can be used, which confirms and formalizes a rule of thumb presented in Feldt and Charter (2003). Theorems 6, 7 and 8 together with Lemma 5 show that for small and moderate differences between the lengths and standard deviations of the parts all five coefficients produce very similar values. If the larger standard deviation is no more than 15 % larger than the smaller, and if the difference between the lengths (in proportions) is at most 0.10, then the differences between the values is less than 0.02. In this case all five coefficients can be used. These results partly explains why the coefficients produce very similar values for the simulated data in Osburn (2000).

If the larger standard deviation is no more than 30 % larger than the smaller, and if the difference between the lengths is at most 0.20, then the differences between the values is less than 0.07. If we exclude the Angoff–Feldt coefficient, the differences between the values of the other four coefficients is less than 0.041. If, in addition, the correlation between the parts is at least 0.50, then the differences between all five coefficients is less than 0.04. In this case application of any coefficient will probably lead to the same conclusion. Finally, even if the differences between the standard deviations and lengths are relatively large, the maximum difference between the coefficients is less than 0.10. More precisely, if the larger standard deviation is no more than 50 % larger than the smaller, if the difference between the lengths is at most 0.30, and if the correlation between the parts is at least 0.50, then the differences between the values of the coefficients is at most 0.09.

Since the five coefficients produce very similar values for small and moderate differences between the standard deviations and the lengths of the two parts, we conclude that reliability estimation in the two-part case tends to be robust to coefficient misspecification. If the difference between the standard deviations and the lengths are large the values of the reliability coefficients diverge. In this case both the classical parallel model and the essential-tau equivalent model fail to hold, and application of the Spearman–Brown formula and Flanagan’s coefficient (Cronbach’s alpha) is not appropriate. Which coefficient should be used in this case, Horst’s formula, the Angoff–Feldt coefficient, or Raju’s beta, appears to be an open problem, and is thus a topic for further investigation.

## References

- Angoff WH (1953) Test reliability and effective test length. Psychometrika 18:1–14CrossRefMATHGoogle Scholar
- Brown W (1910) Some experimental results in the correlation of mental abilities. Br J Psychol 3:296–322Google Scholar
- Cortina JM (1993) What is coefficient alpha? An examination of theory and applications. J Appl Psychol 78:98–104CrossRefGoogle Scholar
- Cronbach LJ (1951) Coefficient alpha and the internal structure of tests. Psychometrika 16:297–334CrossRefGoogle Scholar
- Feldt LS (1975) Estimation of reliability of a test divided into two parts of unequal length. Psychometrika 40:557–561CrossRefMATHGoogle Scholar
- Feldt LS, Brennan RL (1989) Reliability. In: Linn RL (ed) Educational measurement, 3rd edn. Macmillan, New York, pp 105–146Google Scholar
- Feldt LS, Charter RA (2003) Estimating the reliability of a test split into two parts of equal or unequal length. Psychol Methods 8:102–109CrossRefGoogle Scholar
- Grayson D (2004) Some myths and legends in quantitative psychology. Underst Stat 3:101–134CrossRefGoogle Scholar
- Guttman L (1945) A basis for analyzing test–retest reliability. Psychometrika 10:255–282CrossRefMathSciNetMATHGoogle Scholar
- Hogan TP, Benjamin A, Brezinski KL (2000) Reliability methods: a note on the frequency of use of various types. Educ Psychol Meas 60:523–561CrossRefGoogle Scholar
- Horst P (1951) Estimating the total test reliability from parts of unequal length. Educ Psychol Meas 11:368–371CrossRefGoogle Scholar
- IBM (2011) IBM SPSS Statistics 20 Algorithms Manual, p 1069Google Scholar
- Lord FM, Novick MR (1968) Statistical theories of mental test scores. Addison-Wesley, ReadingMATHGoogle Scholar
- Osburn HG (2000) Coefficient alpha and related internal consistency reliability coefficients. Psychol Methods 5:343–355CrossRefGoogle Scholar
- Raju NS (1977) A generalization of coefficient alpha. Psychometrika 42:549–565CrossRefMathSciNetMATHGoogle Scholar
- Rulon PJ (1939) A simplified procedure for determining the reliability of a test by split-halves. Harv Educ Rev 9:99–103Google Scholar
- Spearman C (1910) Correlation calculated from faulty data. Br J Psychol 3:271–295Google Scholar
- Warrens MJ (2014) On Cronbach’s alpha as the mean of all possible k-split alphas. Adv Stat 2014Google Scholar
- Warrens MJ (2015) Some relationships between Cronbach’s alpha and the Spearman–Brown formula. J Classif (in press)Google Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.