Background
Motivating example
Rater II | ||||
---|---|---|---|---|
Rater I | 0 | 1 | 2 | 3 |
0 | 36 | 0 | 0 | 0 |
1 | 7 | 57 | 11 | 0 |
2 | 0 | 23 | 34 | 4 |
3 | 0 | 1 | 19 | 10 |
Agreement coefficient | |||
---|---|---|---|
Weight | Cohen’s kappa | Gwet’s AC2 | Brennan-Prediger’s S |
Linear | 0.674 | 0.759 | 0.739 |
Quadratic | 0.799 | 0.884 | 0.865 |
Methods
Agreement table and grey zone
Rater II | Row | |||||
---|---|---|---|---|---|---|
1 | 2 | \(\dots\) | R | Margin | ||
Rater I | 1 | \(n_{11}\) | \(n_{12}\) | \(\dots\) | \(n_{1R}\) | \(n_{1.}\) |
2 | \(n_{21}\) | \(n_{22}\) | \(\dots\) | \(n_{2R}\) | \(n_{2.}\) | |
\(\vdots\) | \(\vdots\) | \(\vdots\) | \(\ddots\) | \(\vdots\) | \(\vdots\) | |
R | \(n_{R1}\) | \(n_{R2}\) | \(\dots\) | \(n_{RR}\) | \(n_{R.}\) | |
Column Margin | \(n_{.1}\) | \(n_{.2}\) | \(\dots\) | \(n_{.R}\) | n |
Detection of a grey zone
Rater II | ||||
---|---|---|---|---|
Rater I | 0 | 1 | 2 | 3 |
0 | 36 | 0+4=4 | 0 | 0 |
1 | 7-4=3 | 57+13=70 | 11 | 0 |
2 | 0 | 23-13=10 | 34+14=48 | 4 |
3 | 0 | 1 | 19-14=5 | 10 |
No grey zone (Table 4) | With grey zone (Table 1) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Rater II | Rater II | ||||||||
Rater I | 0 | 1 | 2 | 3 | Rater I | 0 | 1 | 2 | 3 |
0
| 0 | 0.267 | 0 | 0 |
0
| 0 | -1.871 | 0 | 0 |
1
| -0.267 | 0 | 0.154 | -0.707 |
1
| 1.871 | 0 | -1.455 | -0.707 |
2
| 0 | -0.154 | 0 | -0.236 |
2
| 0 | 1.455 | 0 | -2.212 |
3
| 0 | 0.707 | 0.236 | 0 |
3
| 0 | 0.707 | 2.212 | 0 |
Derivation of a threshold for \(\Delta\)
\(\kappa\) | \(\Delta\) | |||||||
---|---|---|---|---|---|---|---|---|
\(\rho\) | Min | Med | Max | Min | Med | 90th | 95th | Max |
0.45 | -0.002 | 0.213 | 0.442 | 4.564 | 16.053 | 246.335 | 413.388 | 413.388 |
0.50 | 0.008 | 0.231 | 0.444 | 4.585 | 13.777 | 376.726 | 376.726 | 376.726 |
0.55 | 0.053 | 0.263 | 0.497 | 3.998 | 10.750 | 158.032 | 158.032 | 249.900 |
0.60 | 0.068 | 0.292 | 0.525 | 3.792 | 9.013 | 43.027 | 104.885 | 181.672 |
0.65 | 0.082 | 0.331 | 0.591 | 3.702 | 7.416 | 16.878 | 16.878 | 146.471 |
0.70 | 0.142 | 0.371 | 0.621 | 2.852 | 6.363 | 10.605 | 14.711 | 46.144 |
0.75 | 0.157 | 0.418 | 0.635 | 2.288 | 5.438 | 6.938 | 6.938 | 29.757 |
0.80 | 0.253 | 0.471 | 0.771 | 1.456 | 4.690 | 5.238 | 5.298 | 12.260 |
0.85 | 0.332 | 0.532 | 0.771 | 1.385 | 3.878 | 4.630 | 4.630 | 8.992 |
0.90 | 0.410 | 0.607 | 0.831 | 1.593 | 3.203 | 4.349 | 4.349 | 7.542 |
Results
Numerical experiments
Data generation
n | \(\boldsymbol\rho\) | \(\boldsymbol\kappa\) | n | \(\boldsymbol\rho\) | \(\boldsymbol\kappa\) |
---|---|---|---|---|---|
50 | 0.960 | 0.639 | 500 | 0.910 | 0.630 |
0.980 | 0.756 | 0.960 | 0.754 | ||
0.986 | 0.817 | 0.984 | 0.838 | ||
100 | 0.930 | 0.639 | 1000 | 0.900 | 0.632 |
0.965 | 0.744 | 0.960 | 0.767 | ||
0.985 | 0.835 | 0.980 | 0.832 | ||
250 | 0.925 | 0.633 | |||
0.963 | 0.753 | ||||
0.977 | 0.838 |
Accuracy of \(\Delta\)
R | Case | n | \(\rho\) | True \(\kappa\) | TP | FP | FN | TN | Sens | Spec | MCC |
---|---|---|---|---|---|---|---|---|---|---|---|
3 | GZ at cell (1,2) | 50 | 0.960 | 0.639 | 9126 | 874 | 385 | 9615 | 0.913 | 0.962 | 0.875 |
0.980 | 0.756 | 7352 | 2648 | 58 | 9942 | 0.735 | 0.994 | 0.755 | |||
0.986 | 0.817 | 4017 | 5983 | 80 | 9920 | 0.402 | 0.992 | 0.488 | |||
100 | 0.930 | 0.639 | 9906 | 94 | 168 | 9832 | 0.991 | 0.983 | 0.974 | ||
0.965 | 0.744 | 8432 | 1568 | 98 | 9902 | 0.843 | 0.990 | 0.843 | |||
0.985 | 0.835 | 9338 | 662 | 749 | 9251 | 0.934 | 0.925 | 0.859 | |||
250 | 0.925 | 0.633 | 10000 | 0 | 777 | 9223 | 1.000 | 0.922 | 0.925 | ||
0.963 | 0.753 | 9973 | 27 | 2141 | 7859 | 0.997 | 0.786 | 0.801 | |||
0.977 | 0.838 | 10000 | 0 | 3288 | 6712 | 1.000 | 0.671 | 0.711 | |||
500 | 0.910 | 0.630 | 10000 | 0 | 846 | 9154 | 1.000 | 0.915 | 0.919 | ||
0.960 | 0.754 | 10000 | 0 | 733 | 9267 | 1.000 | 0.927 | 0.929 | |||
0.984 | 0.838 | 10000 | 0 | 2225 | 7775 | 1.000 | 0.778 | 0.797 | |||
1000 | 0.900 | 0.632 | 10000 | 0 | 1424 | 8576 | 1.000 | 0.858 | 0.866 | ||
0.960 | 0.767 | 10000 | 0 | 1430 | 8570 | 1.000 | 0.857 | 0.866 | |||
0.980 | 0.832 | 10000 | 0 | 212 | 9788 | 1.000 | 0.979 | 0.979 | |||
4 | at cell (1,2) | 50 | 0.911 | 0.624 | 3958 | 6042 | 72 | 9928 | 0.396 | 0.993 | 0.484 |
0.969 | 0.731 | 5224 | 4776 | 12 | 9988 | 0.522 | 0.999 | 0.593 | |||
0.982 | 0.839 | 4839 | 5161 | 13 | 9987 | 0.484 | 0.999 | 0.563 | |||
100 | 0.935 | 0.612 | 9694 | 306 | 993 | 9007 | 0.969 | 0.901 | 0.872 | ||
0.982 | 0.746 | 9804 | 196 | 727 | 9273 | 0.980 | 0.927 | 0.909 | |||
0.992 | 0.840 | 9959 | 41 | 229 | 9771 | 0.996 | 0.977 | 0.973 | |||
250 | 0.945 | 0.616 | 10000 | 0 | 933 | 9067 | 1.000 | 0.907 | 0.911 | ||
0.975 | 0.755 | 10000 | 0 | 623 | 9377 | 1.000 | 0.938 | 0.940 | |||
0.987 | 0.824 | 10000 | 0 | 1065 | 8935 | 1.000 | 0.894 | 0.899 | |||
500 | 0.945 | 0.613 | 10000 | 0 | 1381 | 8619 | 1.000 | 0.862 | 0.870 | ||
0.977 | 0.747 | 10000 | 0 | 562 | 9438 | 1.000 | 0.944 | 0.945 | |||
0.987 | 0.824 | 10000 | 0 | 522 | 9478 | 1.000 | 0.948 | 0.949 | |||
1000 | 0.940 | 0.617 | 10000 | 0 | 757 | 9243 | 1.000 | 0.924 | 0.927 | ||
0.975 | 0.740 | 10000 | 0 | 1450 | 8550 | 1.000 | 0.855 | 0.864 | |||
0.985 | 0.828 | 10000 | 0 | 563 | 9437 | 1.000 | 0.944 | 0.945 |
Applications with real data
Assessment of torture allegations
Assessment of PI-RADS v2.1 scores
Radiologist 1 | |||||
---|---|---|---|---|---|
Radiologist 2 | 1 | 2 | 3 | 4 | 5 |
1 | 0 | 4.625 | 3.579 | 1.534 | 0 |
2 | -4.625 | 0 | 0.166 | 3.254 | 0 |
3 | -3.579 | -0.166 | 0 | 3.889 | 0 |
4 | -1.534 | -3.254 | -3.889 | 0 | -1.252 |
5 | 0 | 0 | 0 | 1.252 | 0 |
Agreement coefficient | |||
---|---|---|---|
Weight | Cohen’s kappa | Gwet’s AC2 | Brennan-Prediger’s S |
Linear | 0.651 | 0.793 | 0.747 |
Quadratic | 0.805 | 0.916 | 0.879 |
Discussion
Conclusions
-
The proposed framework has a sufficiently high-level capability to detect the existence of a grey zone for tables of size greater than 50 under all the considered table sizes and true agreement levels.
-
The proposed framework’s accuracy in correctly determining the absence of a grey zone is very high in all the considered scenarios of sample size, table size, and the true agreement level.
-
When there is no grey zone in the agreement table, the framework seldom returns a positive result for the tables with a sample size greater than or equal to 250 under all the considered table sizes and the true agreement levels.
-
The level of false decisions of the framework to detect the grey zones when there is a grey zone in the table is at an acceptable level.
-
The location of a grey zone in the agreement table does not impact the accuracy of the proposed framework.
-
The real-data examples demonstrate that if a grey zone is detected in the agreement table, it is possible to report a higher magnitude of agreement with high confidence. In that sense, if a practitioner is suspected of a grey zone, such as in the first example, the use of the proposed framework leads to more accurate conclusions.
-
Overall, the proposed metric \(\Delta\) and its threshold \(\tau _{\Delta }\) provide the researchers with an easy to implement, reliable, and accurate way of testing the existence of a grey zone in an agreement table.