Derivation of the free-response kappa
For two raters, the usual kappa statistic is (P
o-P
e)/(1-P
e) where P
o is the proportion of observed concordant ratings and P
e is the expected proportion of concordant ratings due to chance alone. When the rating is dichotomous, data can be summarized in a 2 × 2 table. Let us denote by
a the number of findings that are rated as negative by both raters,
b and
c the numbers of findings rated as positive by one rater but negative by the other, and
d the number of findings rated as positive by both raters. There are therefore
a +
d concordant pairs of ratings and
b +
c discordant pairs among N pairs of observations. Assuming that observations are mutually independent, P
o is estimated by (a + d)/N and P
e by [(a + c) (a + b) + (c + d) (b + d)]/N
2. Then, the kappa statistic (in this case, Cohen’s kappa) is given by:
$$ K=\frac{2\left( ad- bc\right)}{\left( b+ c\right) N+2\left( ad- bc\right)} $$
(1)
When patients can contribute more than one observation, data are clustered. Yang et al [
7] proposed a kappa statistic obtained from the usual formula (P
o-P
e)/(1-P
e) where P
o is a weighted average of the proportions of agreement over clusters (patients) and P
e is obtained from weighted averages of marginal proportions of ratings of each rater. With this approach, the kappa for clustered data has the same estimate as when clustering is ignored. Therefore the basic 2 × 2 table is also appropriate for the estimation of agreement for clustered data.
For free-response assessments, each rater reports only positive findings and the number a is unknown. It would be wrong to replace a by 0, as if the raters had not agreed on any negative observation; both the observed agreement and kappa would be underestimated. It would also be incorrect to simply replace a by the number of patients without any positive finding, because several potential lesion sites exist in each patient. Typically, a can be assumed to be high in imaging examinations, because each output displays a large number of anatomical or functional structures or substructures, each potentially positive or negative. Therefore, the number of positive findings in a given patient is usually small in comparison with the potential number of abnormalities that might occur.
We propose here a kappa statistic that describes Cohen’s kappa as
a approaches infinity. The partial derivative of the kappa statistic defined in Eq. (
1) with respect to
a is:
$$ \frac{\partial \widehat{K}}{\partial a}=\frac{2\left( b+ c\right)\left( b+ d\right)\left( c+ d\right)}{{\left[\left( a+ b\right)\left( b+ d\right)+\left( a+ c\right)\left( c+ d\right)\right]}^2} $$
This partial derivative is positive, therefore the kappa statistic increases monotonously with
a. Moreover this derivative has a null limit as
a approaches infinity, which implies that the kappa statistic has a finite limit as
a approaches infinity. We call this limit the free-response kappa (K
FR). Per Eq. (
1), K
FR is the ratio of two functions of
a,
f (
a) = 2 (
ad-
bc) and
g (
a) = (
b +
c)(
a +
b +
c +
d) + 2 (
ad-
bc), both of which approach infinity as
a approaches infinity, so that their ratio is indeterminate. By L’Hôpital rule, K
FR equals the limit of the ratio of the partial derivatives of
f (
a) and
g (
a) as
a approaches infinity, which turns out to be
$$ {K}_{FR}=\frac{2 d}{b+ c+2 d} $$
(2)
Properties of free-response kappa
KFRhas several interesting properties. It does not depend on a, but only on the positive observations b, c, and d. Therefore the uncertainty about a does not preclude the estimation of agreement beyond chance if the number of negative findings can be considered very large.
When interpreting K
FR, it is helpful to consider the numbers of ratings made by each rater individually. The first rater made
c +
d positive observations, and the second rater made
b +
d positive observations. Therefore the denominator
b +
c +
2d is the total number of positive individual observations made by the 2 raters,
2d is the number of positive observations made by either rater that were confirmed by the other, and
b +
c is the number of positive observations made by either rater that were not confirmed by the other. K
FR is thus the proportion of confirmed positive individual observations among all positive individual observations. A K
FR statistic of 0.5 means that half of the positive findings were confirmed by the other rater, which may be considered average, whereas 0.8 might be considered very good. This is in line with published interpretation guidelines for Cohen’s kappa [
8].
When the data are clustered, K
FR can be obtained directly by collapsing the 2 × 2 tables of all clusters into a single 2 × 2 table and applying Eq. (
2). The pooled K
FR is a weighted average of individual free-response kappa statistics of patients with at least one positive observation (each patient is indexed by
k):
$$ {K}_{FR}={\displaystyle \sum_k}{v}_k\frac{2{d}_k}{b_k+{c}_k+2{d}_k} $$
where each weight ν
k represents the proportion of positive ratings in patient k among all positive ratings:
$$ {v}_k=\frac{b_k+{c}_k+2{d}_k}{b+ c+2 d} $$
It follows that patients without any detected lesions do not contribute to the estimate of KFR; their weight is zero. Therefore patient-level clustering does not need to be taken into account to compute KFR, and patients without positive finding can be ignored.
Of note, the equation for K
FR corresponds to the proportion of specific (positive) agreement as described by Fleiss [
9]. While the equation is identical, the purpose and interpretation are different. For Fleiss, specific positive agreement (and also specific negative agreement) is a complementary statistic that enhances the interpretation of overall agreement. The omission of double negative observations is an a priori decision. Importantly, Fleiss is interested in observed agreement, not in agreement corrected for chance. Finally, Fleiss does not address the free-response context.
Variance of the free-response kappa
Because K
FR is bound by 0 and 1, we first normalized the estimator by taking the logit of K
FR, i.e. ln (K
FR/(1- K
FR)). The variance of the estimated logit (K
FR), obtained by the delta method (
Appendix 1) is:
$$ V a r\left( logit\left({K}_{FR}\right)\right)=\frac{\left( b+ c+ d\right)}{\left( b+ c\right) d} $$
(3)
Thus a confidence interval can be obtained for logit (KFR), and the lower and upper confidence bounds back-transformed to the original scale.
An alternative approach is to make use of the direct relationship between KFR and the proportion of congruent pairs of observations among all available observations, p = d/(b + c + d). It is easily shown that KFR = 2p/(1 + p). Therefore a 95% confidence interval can be obtained for p, using any available method for binomial proportions including exact methods, and the confidence bounds can be then back-transformed to the KFR scale.
We have simulated the performance of three confidence interval methods for independent observations at K
FR values of 0.3, 0.5, 0.7, and 0.9, and for sample sizes (N = b + c + d) of 20, 50, 100, and 200. For each condition we generated 50’000 random samples from a binomial distribution with parameters N and p, where p was defined by K
FR/(2-K
FR), which is the inverse of the equation K
FR = 2p/(1 + p). For each sample we computed a 95% confidence interval using Eq. (
3) for the logit of K
FR, and also using 2 methods for the binomial parameter p that are appropriate for small samples in which asymptotic estimation methods may yield incorrect results: the Agresti-Coull method [
10], and the Clopper-Pearson method [
11]. For each situation we report the mean simulated value of K
FR, the proportion of confidence intervals that include the true value, and the mean width of the confidence intervals.
All three methods performed well (Table
1). Confidence intervals based on Eq. (
3) had a lowered coverage (0.932) when the sample size and K
FR were both small. This is because in this case 2% of the samples were degenerate (d = 0 or d = N), and Eq. (
3) could not be applied (if we had excluded these samples the coverage would have been 0.951). The Clopper-Pearson method produced the highest levels of coverage, but this was at the expense of unnecessarily wide confidence intervals. Confidence intervals were narrower for Eq. (
3) and for the Agresti-Coull method.
Table 1
Simulations of the coverage and mean width of 95% confidence intervals for the free-response kappa at selected sample sizes (20, 50, 100, 200) and values of kappa (0.3, 0.5, 0.7, 0.9), using three methods: delta method (Eq.
3), Agresti-Coull confidence limits, and Clopper-Pearson confidence limits
20 | 0.3 | 0.291 | 0.020 | 0.932 | 0.952 | 0.966 | 0.446 | 0.444 | 0.473 |
0.5 | 0.491 | <0.001 | 0.944 | 0.944 | 0.969 | 0.426 | 0.419 | 0.471 |
0.7 | 0.693 | 0 | 0.957 | 0.957 | 0.976 | 0.354 | 0.345 | 0.392 |
0.9 | 0.897 | 0.019 | 0.964 | 0.981 | 0.964 | 0.224 | 0.218 | 0.235 |
50 | 0.3 | 0.297 | <0.001 | 0.962 | 0.962 | 0.962 | 0.293 | 0.294 | 0.314 |
0.5 | 0.497 | 0 | 0.949 | 0.949 | 0.965 | 0.284 | 0.281 | 0.305 |
0.7 | 0.697 | 0 | 0.953 | 0.936 | 0.968 | 0.230 | 0.227 | 0.246 |
0.9 | 0.899 | <0.001 | 0.958 | 0.958 | 0.974 | 0.134 | 0.134 | 0.142 |
100 | 0.3 | 0.298 | 0 | 0.954 | 0.954 | 0.954 | 0.211 | 0.212 | 0.223 |
0.5 | 0.498 | 0 | 0.945 | 0.945 | 0.968 | 0.204 | 0.203 | 0.215 |
0.7 | 0.698 | 0 | 0.946 | 0.946 | 0.966 | 0.164 | 0.163 | 0.172 |
0.9 | 0.899 | 0 | 0.948 | 0.948 | 0.963 | 0.093 | 0.093 | 0.098 |
200 | 0.3 | 0.299 | 0 | 0.947 | 0.947 | 0.959 | 0.151 | 0.151 | 0.157 |
0.5 | 0.499 | 0 | 0.948 | 0.948 | 0.957 | 0.146 | 0.145 | 0.151 |
0.7 | 0.699 | 0 | 0.952 | 0.952 | 0.952 | 0.116 | 0.116 | 0.120 |
0.9 | 0.900 | 0 | 0.957 | 0.957 | 0.957 | 0.065 | 0.065 | 0.068 |
Of note, the mean values of observed KFR were slightly below the parameter values, especially at low sample sizes. This is because we simulated with a fixed parameter p, and KFR = 2p/(1 + p) is a concave function. By Jensen’s inequality, the expectation of a concave function of p (i.e., the mean observed KFR) will be then less than the function of the expectation of p (i.e., the KFR that corresponds to the parameter p).
To be valid, these estimation methods require observations to be mutually independent. This may apply in some circumstances: e.g., if a paired screening test is applied to a large population, and only those with at least one positive result are referred for further investigation. But for most imaging procedures data are naturally clustered within patients. Then the proposed asymptotic variance of K
FR would be biased. In presence of clustering, a bootstrap procedure can be used to obtain a confidence interval (see
Appendix 2).