Bimonthly, Established in 1959
Open access journal

Correlation and agreement: overview and clarification of competing concepts and measures

Abstract

Summary: Agreement and correlation are widely-used concepts that assess the association between variables. Although similar and related, they represent completely different notions of association. Assessing agreement between variables assumes that the variables measure the same construct, while correlation of variables can be assessed for variables that measure completely different constructs. This conceptual difference requires the use of different statistical methods, and when assessing agreement or correlation, the statistical method may vary depending on the distribution of the data and the interest of the investigator. For example, the Pearson correlation, a popular measure of correlation between continuous variables, is only informative when applied to variables that have linear relationships; it may be non-informative or even misleading when applied to variables that are not linearly related. Likewise, the intraclass correlation, a popular measure of agreement between continuous variables, may not provide sufficient information for investigators if the nature of poor agreement is of interest. This report reviews the concepts of agreement and correlation and discusses differences in the application of several commonly used measures.

Keywords: concordance correlation, intraclass correlation, Kendall’s tau, non-linear association, Pearson’s correlation, Spearman’s rho

1. Introduction

Agreement and correlation are widely used concepts in the medical literature. Both are used to indicate the strength of association between variables of interest, but they are conceptually distinct and, thus, require the use of different statistics.

Correlation focuses on the association of changes in two outcomes, outcomes that often measure quite different constructs such as cancer and depression. The Pearson correlation is the most popular measure of the association between two continuous outcomes, but it is only useful when measuring linear relationships between variables. If the relationship is non-linear, the Pearson correlation generally does not provide a good indication of association between the variables. Another problem is that using the standard interpretation of Pearson correlation coefficients can, in some circumstances, lead to incorrect conclusions.

Agreement, also known as reproducibility, is a concept closely related to, but fundamentally different from, correlation. Like correlation, agreement also assesses the relationships between outcomes of interest, but, as the name indicates, the emphasis is on the degree of concordance in the opinions between two or more individuals or in the results between two or more assessments of the variable of interest. An example of agreement in mental health research is the consensus between multiple clinicians about the psychiatric diagnoses of a group of patients. In biomedical sciences agreement can also include measures of the reproducibility (i.e., reliability) of a laboratory test result when repeated in the same center or when conducted in multiple centers under the same conditions. It is not sensible to speak of agreement (reproducibility) between variables that measure different constructs; so when measuring the association between different variables – such as weight and height – one can assess correlation but not agreement. For continuous outcomes, the intraclass correlation (ICC) is a popular measure of agreement. Like the Pearson correlation, the ICC is an estimate of the magnitude of the relationship between variables (in this case, between multiple assessments of the same variable). However, the ICC also takes into account rater bias, the element that distinguishes agreement from correlation; that is, good agreement (reproducibility) not only requires good correlation, it also requires small rater bias.

In this report, we provide an overview of popular measures and statistical methods for assessing the two different notations of association between variables. We also clarify the key differences between the measures and between the methods used to assess the measures. We focus on continuous outcomes and assume all variables are continuous unless stated otherwise.

2. Correlation measures

2.1. Pearson correlation

Consider a sample of n subjects and a bivariate continuous outcome, (ui, vi), from each subject within the sample (1≤in). The Pearson correlation is the most popular statistic for measuring the association between the two variables ui and vi:[1]

 

p⌢=∑ni=1(ui−u.¯¯¯)(vi−v.¯¯¯)∑ni=1(ui−u.¯¯¯)2√∑ni=1(vi−v.¯¯¯)2√′u.¯¯¯=1n∑i=1nui,v.¯¯¯=1n∑i=1nvi,
(1)

 

where u. (v.) denotes the sample mean of ui (vi) The Pearson correlation p⌢ ranges between -1 and 1, with 1(-1) indicating perfect positive (negative) correlation and 0 indicating no association between the variables.

As popular as it is, the Pearson correlation is only appropriate for measuring correlation between ui and vi when the two variables follow a linear relationship. If the bivariate outcome (ui, vi) follows a non-linear relationship, p⌢ is not an informative measure and is difficult to interpret.

To see this, let μu(μv) and σu2 (σv2) denote the (population) mean and (population) variance of the variable ui (vi). The Pearson correlation is an estimate of the following product moment correlation:

 

p=Corr(ui,vi)=Cov(ui,vi)Var(ui)Var(vi)−−−−−−−−−−−−−√=[E(ui−μu)(vi−μv)σ2uσ2v−−−−√].
(2)

 

Unlike p⌢, which measures correlation between ui and vi based on the sample, the product-moment correlation p is the population-level correlation, which cannot be calculated but is estimated by p⌢. Thus, p⌢ may also be referred to as the ‘sample product-moment correlation’.

If ui and vi have a linear relationship, then ui=avi+b+εi, where a and b are some constants, and εi denotes random errors with mean 0 and variance σε2. By centering ui (vi) at its mean, we have: uiμu= a(v-μv)+εi. It follows that σu2=a2σv2+σε2. If ui and vi are perfectly correlated, that is, σε2=0, it follows from Equation (2) that p=1 or (-1), depending on whether a is positive or negative. Also, if ui and vi are uncorrelated, or independent, that is, a=0, then p=0 and vice versa.

If ui and vi have a non-linear relationship, the product moment correlation generally does not provide an informative measure of correlation. The example below shows that the Pearson correlation in this case can be quite misleading.

Example 1. Suppose that ui and vi are perfectly correlated and follow the non-linear relationship, ui=vi9. Further, assume that vi follows a standard normal distribution N(0, 1) with mean 0 and variance 1. Then, the product-moment correlation is:

 

p=E(v10i)−E(v9i)E(vi)Var(v9i)Var(vi)−−−−−−−−−−−−−−√=E(v10i)E(v18i)−E2(v9i)−−−−−−−−−−−−−−√=E(v10i)E(v18i)−−−−−−√=0.161.
(3)

 

The poor association between ui and vi as indicated by the product-moment correlation contradicts the conceptual perfect correlation between the two variables. Thus, the product-moment and its sample counterpart, the Pearson correlation, generally do not apply to non-linear relationships.

2.2. Spearman’s Rho

Spearman’s rho is also a popular measure of association.

Unlike the Pearson correlation, it also applies to non-linear relationship, thereby addressing the aforementioned limitation associated with the Pearson correlation.

Let qi (ri) denote the rankings of ui (vi), (1 ≤ in). Spearman’s rho is defined as:

 

ρ⌢=∑ni=1(qi−q.¯¯¯)(ri−r.¯¯¯)∑ni=1(qi−q.¯¯¯)2∑ni=1(ri−r.¯¯¯)2√′q.¯¯¯=1n∑i=1nqi,r.¯¯¯=1n∑i=1nri.
(4)

 

By comparing (1) and (4), it is clear that ρ⌢ is really the Pearson correlation when applied to the rankings (qi , ri) of the original variables (ui, vi). Since the rankings only concern the ordering of the observations, relationships among the rankings are always linear, regardless of whether the original variables are linearly related. Thus, Spearman’s rho not only has the same interpretation as the Pearson correlation, but also applies to non-linear relationships.

The Spearman ρ⌢ ranges between -1 and 1, with 1 and -1 indicating perfect positive (negative) correlation; when ρ⌢ =0 there is no association between the variables ui and vi. If ρ⌢ =1 then qi= ri, in which case,

 

ui < uj, vi < vj or ui > uj, vi > vj for all 1 ≤ i < jn.
(5)

 

If ρ⌢ =-1, then qi=n-ri+1, in which case,

 

ui < uj, vi > vj or ui > uj, vi < vj for all 1 ≤ i < jn.
(6)

 

Any two pairs of bivariate outcomes (ui, vi) and (uj, vj) that satisfy (5) or (6) are said to be concordant or discordant; that is, ui and vi are either both larger or both smaller than uj and vj. Thus, perfect positive (negative) correlation by Spearman’ rho corresponds to perfect concordance (discordance); that is, concordant (discordant) pairs (ui, vi) and (uj, vj) for all 1≤i<jn.

Example 2. Table 1 shows 12 observations of the bivariate outcome (ui, vi) as described in Example 1, and the ranks associated with these observations. Note that ui and vi are perfectly related, so their rankings are identical; that is, qi= ri.

Table 1

A sample of 12 bivariate outcomes (ui, vi) simulated with ui= vi9 and vi from standard normal N (0,1).
ui 0.26 1.49 1.39 0.65 -0.49 -1.38 1.168 0.87 -0.96 2.15 -0.03 -1.08
vi 0 38.1 19.4 0.02 -0.002 -18.5 4.06 0.29 -0.68 971.6 0 -2.10
qi(ri) 6 11 10 7 4 1 9 8 3 12 5 2

In this example the Pearson correlation p⌢=0.531, while Spearman’s ρ⌢ =1. Thus, only the Spearman rho captures the perfect non-linear relationship between ui and vi.

Note that the Pearson correlation p⌢=0.531 has a higher upward bias than the product-moment correlation p=0.161; this occurs due to the small sample size, n=12. As sample size increases, p⌢ becomes closer to p, a property known as ‘consistency’ in statistics. For example, we also simulated (ui, vi) with n=1000 and obtained p⌢ =0.173, much closer to p.

Like the Pearson correlation, the Spearman’s rho in (4) is a statistic based on a sample. This sample Spearman rho is an estimate of the following population Spearman rho:

 

ρ = 12E[I(uj > ui)I(vk < vi)] − 3, for all 1 ≤ i < j < kn.
(7)

 

In Equation (7), E[I(uj<ui)I(vk<vi)] stands for the mathematical expectation of I(uj<ui)I(vk<vi) and I(uj<ui) (similarly I(vk<vi)) denotes an indicator with I(uj<ui)=1(0) if uj<ui. It can be shown that ρ⌢ =1(-1) if (ui, vi) are perfectly concordant (discordant) and vice versa.

Note that the sample Spearman’s rho in (4) is referred to as Spearman’s rho in the literature. Unlike the Pearson correlation, there is no formal name for the population Spearman’s rho in (7). In general, the lack of a formal name for the population version does not cause confusion, since it is usually clear which one is used within the context of a discussion. Like all statistics, the population version of a statistic is called a parameter in statistical lingo. The statistic and parameter serve different purposes. For example, only the parameter can be used in stating statistical hypotheses, such as the null hypothesis, H:ρ=0, for testing whether the population Spearman’s rho is 0. Reported values of Spearman’s rho by studies are always the sample Spearman rho.

2.3. Kendall’s Tau

Another alternative for non-linear association is Kendall’s tau.[2] Like Spearman’s rho, Kendall’s tau also exploits the concept of concordance and discordance to derive a measure for bivariate outcomes. Unlike Spearman’s rho, it uses the notion of concordant and discordant pairs directly in the definition of this correlation measure.

Specifically, Kendall’s τ (sample version) is defined as:

 

τ⌢=nc−ndnt,nt=12n(n−1),nc=number of concordand pairs,nd=number of discordand pairs.
(8)

 

In the above, nt = ½ n (n – 1) n 1 is the total number of concordant and discordant pairs in the sample. If nc=nt(nd=nt), then τ⌢ =1(-1) and vice versa. Also, if there is no association between ui and vi, then nc and nd should be close to each other and τ⌢ should be close to 0 (not exactly 0 due to sampling variability). Thus, like Spearman’s rho, τ⌢ =1(-1) corresponds to perfect concordance (discordance). A value of τ⌢ close to 0 indicates weaker or no association between the variables ui and vi.

Like the Pearson and Spearman correlation, the sample Kendall’s τ⌢ in (8) estimates the following population parameter:

 

τ = 2E[I(ui < uj)I(vi < vk)] − 1, for all 1 ≤ i < jn.

 

Like its sample counterpart, τ also ranges between -1 and 1. If (5) holds true for all pairs (ui, vi) and (uj, vj), then E[I(ui<uj)I(vi<vj)]=1 and τ=1. Likewise, if (6) holds true for all pairs, then E[I(ui<uj)I(vi<vj)]=0 and τ⌢ =-1. Thus, τ=1(-1) corresponds to perfect concordance (discordance). Finally, if ui and vi are independent, then E[I(ui < uj)I(vi < vj)]=½ and τ=0. Thus, τ=0 indicates no association between ui and vi, and vice versa.

Example 3. Consider the data in Example 2. The sample Kendall’s tau τ⌢ =-1. Thus, like Spearman’s rho, Kendall’s tau also provides a sensible measure of association for non-linearly related variables.

3. Agreement and measures of agreement

Agreement, or reproducibility, is another widely used concept for assessing the relationship among outcomes. As indicated in the Introduction, unlike variables considered in correlation analysis, variables considered for agreement must measure the same construct. Conversely, measures of correlation considered in Section 2 generally do not apply to agreement.

Example 4. Consider two judges who rate each subject from a study of 5 subjects sampled from a population of interest using a scale from 1 to 10. Let ui and vi denote the two judges’ ratings on the ith subject (1<i<5). Suppose that the judges’ ratings from the subjects are as follows:

 

(ui, vi):(1, 6), (2, 7), (3, 8), (4, 9), (5, 10).

 

Since ui and vi are linearly related, the Pearson correlation can be applied, yielding p⌢=1, indicating perfect correlation. However, the data clearly do not indicate perfect agreement; in fact, the two judges hardly agree with one another.

The poor agreement in this hypothetical example is due to bias in judges’ ratings. The mean ratings for the two judges are 3 (for ui) and 8 (for vi). Thus, despite the perfect correlation between the ratings, the two judges do not have good agreement because of bias in their ratings of the subjects; either ui has downward or vi has upward bias (or both).

The issue of bias does not apply to correlation because the variables considered for correlation generally measure different constructs and, thus, typically have different means. For the Pearson correlation, the sample means u. and v. are removed from the calculations of the correlation in (1), thus, the Pearson correlation is independent of differences between the (sample) means of the variables being correlated.

3.1. Intraclass correlation

Intraclass correlation (ICC) is a popular measure of agreement for continuous outcomes. Like the Pearson correlation, the ICC requires a linear relationship between the variables. However, it differs from the Pearson correlation in one key respect; the ICC also takes into account differences in the means of the measures being considered. In addition, the ICC can be applied to situations where there are three or more separate raters.

Consider a study with n subjects and assume each subject is rated by a different group of K judges. Let yik denote the rating of the ith subject by the kth judge (1 ≤ in, 1 ≤ kK). The ICC is defined based on the following linear mixed-effects model:[3, 4, 5]

 

yik=μ+βi+εik,1≤k≤K,1≤i≤n,βi∼N(0,σ2β),εik∼N(0,σ2).
(9)

 

In the above model, the fixed effect μ is the (population) mean rating of the study population over all possible K judges from the population of judges; that is, the random effect or latent variable. βi represents the difference between the mean rating of the ith subject and the mean rating of the study population μ. Thus, the sum u+βi represents the mean rating of the ith subject. The intraclass correlation (ICC) is defined as the variance ratio, pICC=σ2βσ2β+σ2, of the variance σβ2 of the mean rating of the subjects (u+βi) to the total variance consisting of σβ2 plus the variance σ2 of the judges.

If there are only two judges (K=2), then under the linear mixed-effects model in (9) the productmoment correlation between yi1 and yi2 is the same as the ICC; that is, Corr(yi1,yi2)=σ2βσ2β+σ2. Moreover, yi1 and yi2 have the same mean (μ) and variance (σ2 ). Thus, in this special case, the ICC is the same as the product moment correlation (pICC= p). Note that this result is not a contradiction to the data in Example 4, since ui and vi do not have the same mean and thus the linear mixed-effects model in (9) does not apply to the data and the ICC no longer serves its intended purpose in this case. However, since differences in means between judges’ ratings decrease the ICC, this agreement index may still be applied in this situation to indicate poorer agreement. Follow-up analyses are necessary to determine whether poor agreement is due to bias or large variability or both between the judges.

Example 5. Consider again Example 4 and let yi1=ui and yi1=vi. By fitting the model in (9) to the data, we obtain estimates σ⌢2β = 0 and σ⌢2 =9.167. Thus, the (sample) ICC based on the data is p⌢ICC =0, which is quite different from the Pearson correlation. Although the judges’ ratings are perfectly correlated, agreement between the judges is extremely poor.

Note that p⌢ICC is not a valid measure of agreement between yi1 and yi2 for the data in Example 5, since the assumption of a common mean between yi1 and yi2 is not met by the data. However, it is precisely this assumption that makes p⌢ICC totally different from the Pearson correlation p⌢ =(1). We may revise the model in (9) to account for the bias in the judges’ ratings to consider:

 

yik=μk+βi+εik,1≤k≤K,1≤i≤n,βi∼N(0,σ2β),εik∼N(0,σ2).
(10)

 

where the added fixed-effect μk accounts for the difference between the two judges. By fitting the above model, we obtain estimates σ⌢2β =1.256, σ⌢2 =0, μ⌢1=3 and μ⌢2 =5. Once accounting for bias, the two judges have perfect agreement. The model in (10) also provides mean ratings μ⌢K for the judges. The positive estimate σ⌢2β describes the variability among the subjects. Although the correct model for the data, the ICC calculated from the model in (10) no longer has the interpretation as a measure of agreement. In fact, σ⌢2βσ⌢2β+σ⌢2=1, the same as the Pearson correlation p⌢ =1 as we have calculated in Example 4.

Note since pICC≥0 we can either reverse code some of the judges’ ratings or use a different index, such as the concordance correlation, discussed below.

3.2. Concordance correlation

The concordance correlation (CCC) is another measure of agreement which, unlike the ICC, does not assume a common mean for judges’ ratings at the outset, so it can be used to assess both the level of agreement and the level of disagreement. However, a major limitation of the CCC is that it only applies to two judges at a time.

Consider a study with n subjects and assume each subject is rated by a different group of two judges. Let yik again denote the rating of the ith subject by the kth judge (1≤in, 1≤k≤2). Let μk=E(yik) and σk2 =Var(yik), denoting the mean and variance of yik, and σ12=Cov(yi1, yi2), denoting the covariance between yi1 and yi2. The CCC is defined as:[6]

 

PCCC=2σ12σ2β+σ2β+(μ1−μ2)2.
(11)

 

Unlike the ICC, no statistical model is assumed in the definition of pCCC. Further, the two judges can come from two different populations of judges with different means and variances.

The CCC pCCC has a nice decomposition, pCCC=pCb, where p is the product-moment correlation in (2) and Cb is called the bias correction factor given by:

 

Cb=2σ1σ2+σ2σ1+(μ1−μ2)2σ1σ2.
(12)

 

It can be shown that pCCC=1(-1) if and only if p=1(-1), μ1=μ2 and σ12=σ22.[6] Thus, pCCC=1(-1) if and only if yi1 = (10) yi2(yi1=-yi2), that is, when there is perfect agreement (disagreement). The bias correction factor Cb(0≤Cb≤1) in (12) assesses the level of bias, with smaller Cb indicating larger bias. Thus, unlike the ICC, poor agreement can result from low correlation (small p) or large bias (small Cb).

Example 6. Consider again Example 5. The (sample) mean and variance of yi1, and the (sample) correlation between yi1 and yi2 are given by : μ⌢1=3, μ⌢2=8, σ⌢21=2.5, σ⌢22=2.5 and σ⌢12=1. Thus, it follows from (11) that p⌢ccc=2σ⌢12σ⌢21+σ⌢22+(μ⌢1−μ⌢2)2= 0.053 . We can also obtain p⌢CCC by using the decomposition result, which in our case yields p⌢=1, C⌢b=0.0533 and p⌢CCC=p⌢C⌢b = 0.0533.

Note that unlike correlation the issue of linear versus non-linear association does not arise when assessing agreement. This is because good agreement requires an approximate linear relationship between the outcomes. For example, in the case of two raters, good agreement requires that yi1 and yi2 are close to each other, such as yi1 = yi2 in the case of perfect agreement.

4. Discussion

We discussed the concepts of agreement and correlation and described various measures that can be used to assess the relationships among variables of interest. We focused on the measures and methods for continuous outcomes. For non-continuous outcomes, different methods must be applied. For example, for categorical outcomes a different version of Kendall’s tau, known as Kendall’s tau b can be used for assessing correlation and Kappa can be used for assessing agreement.[7]

Biography

 

Ms. Jinyuan Liu obtained her bachelor’s of science degree in statistics from Nanjing University of Posts and Telecommunications in 2015. She is currently a master’s student in the Department of Biostatistics and Computational Biology at the University of Rochester in New York, USA. Her research interests include categorical data analysis, machine learning, and social networks.

Funding Statement

The work was supported in part by a grant (GM108337) from the National Institutes of Health and the National Science Foundation (Tang and Tu) and a pilot grant (UR-CTSI GR500208) from the Clinical and Translational Sciences Institute at the University of Rochester Medical Center (Feng and Tu).

Footnotes

Conflict of interest statement: The authors report no conflict of interest.

 

Contributed by

Authors’ contributions: All authors worked together on this manuscript. In particular, JYL, WT and XMT made major contributions to the section on correlation, GQC, YL and CYF made major contributions to the section on agreement, and JYL and XMT drafted and finalized the manuscript. All authors read and approved the final manuscript.

References

1. Stigler SM. Francis Galton’s Account of the Invention of Correlation. Statist Sci. 1989;4(2):73–79. doi: 10.1214/ss/1177012580. [CrossRef] []
2. Kowalski J, Tu XM. Modern Applied U Statistics. New York: Wiley; 2007. []
3. Lu N, Chen T, Wu P, Gunzler D, Zhang H, He H, et al. Functional response models for intraclass correlation coefficients. Applied Statistics. 2014;41:2539–2556. doi: 10.1080/02664763.2014.920780. [CrossRef] []
4. McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods. 1996;1:30–46. doi: 10.1037/1082-989X.1.4.390. [CrossRef] []
5. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86(2):420–428. [PubMed] []
6. Lin LI. A concordance correlation coefficient to evaluate reproducibility. Biometrics. 1989;45(1):255–268. [PubMed] []
7. Tang W, He H, Tu XM. Applied Categorical and Count Data Analysis. Boca Raton, FL: Chapman & Hall/CRC; 2012. []