Statistical methods for assessing agreement between two methods of clinical measurement☆
Introduction
CLINICIANS often wish to have data on, for example, cardiac stroke volume or blood pressure where direct measurement without adverse effects is difficult or impossible. The true values remain unknown. Instead indirect methods are used, and a new method has to be evaluated by comparison with an established technique rather than with the true quantity. If the new method agrees sufficiently well with the old, the old may be replaced. This is very different from calibration, where known quantities are measured by a new method and the result compared with the true value or with measurements made by a highly accurate method. When two methods are compared neither provides an unequivocally correct measurement, so we try to assess the degree of agreement. But how?
The correct statistical approach is not obvious. Many studies give the product–moment correlation coefficient (r) between the results of the two measurement methods as an indicator of agreement. It is no such thing. In a statistical journal we have proposed an alternative analysis (Altman and Bland, 1983), and clinical colleagues have suggested that we describe it for a medical readership.
Most of the analysis will be illustrated by a set of data (table) collected to compare two methods of measuring peak expiratory flow rate (PEFR).
Section snippets
Sample data
The sample comprised colleagues and family of J.M.B. chosen to give a wide range of PEFR but in no way representative of any defined population. Two measurements were made with a Wright peak flow meter and two with a mini Wright meter, in random order. All measurements were taken by J.M.B., using the same two instruments. (These data were collected to demonstrate the statistical method and provide no evidence on the comparability of these two instruments.) We did not repeat suspect readings and
Plotting data
The first step is to plot the data and draw the line of equality on which all points would lie if the two meters gave exactly the same reading every time (Fig. 1). This helps the eye in gauging the degree of agreement between measurements, though, as we shall show, another type of plot is more informative.
Inappropriate use of correlation coefficient
The second step is usually to calculate the correlation coefficient (r) between the two methods. For the data in Fig. 1, r = 0.94 (p < 0.001). The null hypothesis here is that the measurements by the two methods are not linearly related. The probability is very small and we can safely conclude that PEFR measurements by the mini and large meters are related. However, this high correlation does not mean that the two methods agree:
- (1)
r measures the strength of a relation between two variables, not the
Measuring agreement
It is most unlikely that different methods will agree exactly, by giving the identical result for all individuals. We want to know by how much the new method is likely to differ from the old; if this is not enough to cause problems in clinical interpretation we can replace the old method by the new or use the two interchangeably. If the two PEFR meters were unlikely to give readings which differed by more than, say, 10 l/min, we could replace the large meter by the mini meter because so small a
Precision of estimated limits of agreement
The limits of agreement are only estimates of the values which apply to the whole population. A second sample would give different limits. We might sometimes wish to use standard errors and confidence intervals to see how precise our estimates are, provided the differences follow a distribution which is approximately Normal. The standard error of d is , where n is the sample size, and the standard error of d − 2s and d + 2s is about . 95% confidence intervals can be calculated by
Example showing good agreement
Fig. 3 shows a comparison of oxygen saturation measured by an oxygen saturation monitor and by pulsed oximeter saturation, a new non-invasive technique (Tytler and Seeley, in press). Here the mean difference is 0.42 percentage points with 95% confidence interval 0.13–0.70. Thus pulsed oximeter saturation tends to give a lower reading by between 0.13 and 0.70. Despite this, the limits of agreement (−2.0 and 2.8) are small enough for us to be confident that the new method can be used in place of
Relation between difference and mean
In the preceding analysis it was assumed that the differences did not vary in any systematic way over the range of measurement. This may not be so. Fig. 4 compares the measurement of mean velocity of circumferential fibre shortening (VCF) by the long axis and short axis in M-mode echocardiography (D’Arbela et al., unpublished). The scatter of the differences increases as the VCF increases. We could ignore this but the limits of agreement would be wider apart than necessary for small VCF and
Repeatability
Repeatability is relevant to the study of method comparison because the repeatabilities of two methods of measurement limit the amount of agreement which is possible. If one method has poor repeatability—i.e., there is considerable variation in repeated measurements on the same subject—the agreement between the two methods is bound to be poor too. When the old method is the more variable one, even a new method which is perfect will not agree with it. If both methods have poor repeatability, the
Measuring agreement using repeated measurements
If we have repeated measurements by each of two methods on the same subjects we can calculate the mean for each method on each subject and use these pairs of means to compare the two methods using the analysis for assessing agreement described above. The estimate of bias will be unaffected, but the estimate of the standard deviation of the differences will be too small, because some of the effect of repeated measurement error has been removed. We can correct for this. Suppose we have two
Discussion
In the analysis of measurement method comparison data neither the correlation coefficient (as we show here) nor techniques such as regression analysis1 are appropriate. We suggest replacing these misleading analyses by a method that is simple both to do and to interpret. Further, the same method may be used to analyse the repeatability of a single measurement method or to compare measurements by two observers.
Why has a totally inappropriate method, the correlation coefficient, become almost
Conflict of interest
None.
References (8)
- et al.
Relationship between initial blooc pressure and its fall with treatment
Lancet
(1985) - et al.
Measurement in medicine: the analysis of method comparison studies
Statistician
(1983) Statistical methods in medical research
(1971)- British Standards Institution, 1979. Precision of test methods I. Guide for the determination and reproducibility for a...
Cited by (729)
Jumping towards field-based ground reaction force estimation and assessment with OpenCap
2024, Journal of BiomechanicsCan atrial lead system ameliorate the diagnosis of atrial arrhythmias?
2024, Medical HypothesesAdapting to a Pandemic: Web-Based Residency Training and Script Concordance Testing in Emergency Medicine During COVID-19
2023, Disaster Medicine and Public Health Preparedness
- ☆
This article was originally published in The Lancet 1986 327(8476) 307–310. The article is republished with permission from The Lancet.