Correlation and agreement: overview and clarification of competing concepts and measures

Jinyuan LIU,^1,* Wan TANG,² Guanqin CHEN,¹ Yin LU,^3,⁴ Changyong FENG,¹ and Xin M. TU^1,^*

Abstract

Summary: Agreement and correlation are widely-used concepts that assess the association between variables. Although similar and related, they represent completely different notions of association. Assessing agreement between variables assumes that the variables measure the same construct, while correlation of variables can be assessed for variables that measure completely different constructs. This conceptual difference requires the use of different statistical methods, and when assessing agreement or correlation, the statistical method may vary depending on the distribution of the data and the interest of the investigator. For example, the Pearson correlation, a popular measure of correlation between continuous variables, is only informative when applied to variables that have linear relationships; it may be non-informative or even misleading when applied to variables that are not linearly related. Likewise, the intraclass correlation, a popular measure of agreement between continuous variables, may not provide sufficient information for investigators if the nature of poor agreement is of interest. This report reviews the concepts of agreement and correlation and discusses differences in the application of several commonly used measures.

Keywords: concordance correlation, intraclass correlation, Kendall’s tau, non-linear association, Pearson’s correlation, Spearman’s rho

1. Introduction

Agreement and correlation are widely used concepts in the medical literature. Both are used to indicate the strength of association between variables of interest, but they are conceptually distinct and, thus, require the use of different statistics.

Correlation focuses on the association of changes in two outcomes, outcomes that often measure quite different constructs such as cancer and depression. The Pearson correlation is the most popular measure of the association between two continuous outcomes, but it is only useful when measuring linear relationships between variables. If the relationship is non-linear, the Pearson correlation generally does not provide a good indication of association between the variables. Another problem is that using the standard interpretation of Pearson correlation coefficients can, in some circumstances, lead to incorrect conclusions.

Agreement, also known as reproducibility, is a concept closely related to, but fundamentally different from, correlation. Like correlation, agreement also assesses the relationships between outcomes of interest, but, as the name indicates, the emphasis is on the degree of concordance in the opinions between two or more individuals or in the results between two or more assessments of the variable of interest. An example of agreement in mental health research is the consensus between multiple clinicians about the psychiatric diagnoses of a group of patients. In biomedical sciences agreement can also include measures of the reproducibility (i.e., reliability) of a laboratory test result when repeated in the same center or when conducted in multiple centers under the same conditions. It is not sensible to speak of agreement (reproducibility) between variables that measure different constructs; so when measuring the association between different variables – such as weight and height – one can assess correlation but not agreement. For continuous outcomes, the intraclass correlation (ICC) is a popular measure of agreement. Like the Pearson correlation, the ICC is an estimate of the magnitude of the relationship between variables (in this case, between multiple assessments of the same variable). However, the ICC also takes into account rater bias, the element that distinguishes agreement from correlation; that is, good agreement (reproducibility) not only requires good correlation, it also requires small rater bias.

In this report, we provide an overview of popular measures and statistical methods for assessing the two different notations of association between variables. We also clarify the key differences between the measures and between the methods used to assess the measures. We focus on continuous outcomes and assume all variables are continuous unless stated otherwise.

2. Correlation measures

2.1. Pearson correlation

Consider a sample of n subjects and a bivariate continuous outcome, (u_i, v_i), from each subject within the sample (1≤i≤n). The Pearson correlation is the most popular statistic for measuring the association between the two variables u_i and v_i:^[1]

p⌢=∑ni=1(ui−u.¯¯¯)(vi−v.¯¯¯)∑ni=1(ui−u.¯¯¯)2√∑ni=1(vi−v.¯¯¯)2√′u.¯¯¯=1n∑i=1nui,v.¯¯¯=1n∑i=1nvi,

(1)

where u. (v.) denotes the sample mean of u_i (v_i) The Pearson correlation p⌢ ranges between -1 and 1, with 1(-1) indicating perfect positive (negative) correlation and 0 indicating no association between the variables.

As popular as it is, the Pearson correlation is only appropriate for measuring correlation between u_i and v_i when the two variables follow a linear relationship. If the bivariate outcome (u_i, v_i) follows a non-linear relationship, p⌢ is not an informative measure and is difficult to interpret.

To see this, let μ_u(μ_v) and σ_u² (σ_v²) denote the (population) mean and (population) variance of the variable u_i (v_i). The Pearson correlation is an estimate of the following product moment correlation:

p=Corr(ui,vi)=Cov(ui,vi)Var(ui)Var(vi)−−−−−−−−−−−−−√=[E(ui−μu)(vi−μv)σ2uσ2v−−−−√].

(2)

Unlike p⌢, which measures correlation between u_i and v_i based on the sample, the product-moment correlation p is the population-level correlation, which cannot be calculated but is estimated by p⌢. Thus, p⌢ may also be referred to as the ‘sample product-moment correlation’.

If u_i and v_i have a linear relationship, then u_i=av_i+b+ε_i, where a and b are some constants, and ε_i denotes random errors with mean 0 and variance σ_ε². By centering u_i (v_i) at its mean, we have: u_i – μ_u= a(v-μ_v)+ε_i. It follows that σ_u²=a²σ_v²+σ_ε². If u_i and v_i are perfectly correlated, that is, σ_ε²=0, it follows from Equation (2) that p=1 or (-1), depending on whether a is positive or negative. Also, if u_i and v_i are uncorrelated, or independent, that is, a=0, then p=0 and vice versa.

If u_i and v_i have a non-linear relationship, the product moment correlation generally does not provide an informative measure of correlation. The example below shows that the Pearson correlation in this case can be quite misleading.

Example 1. Suppose that u_i and v_i are perfectly correlated and follow the non-linear relationship, u_i=v_i⁹. Further, assume that v_i follows a standard normal distribution N(0, 1) with mean 0 and variance 1. Then, the product-moment correlation is:

p=E(v10i)−E(v9i)E(vi)Var(v9i)Var(vi)−−−−−−−−−−−−−−√=E(v10i)E(v18i)−E2(v9i)−−−−−−−−−−−−−−√=E(v10i)E(v18i)−−−−−−√=0.161.

(3)

The poor association between u_i and v_i as indicated by the product-moment correlation contradicts the conceptual perfect correlation between the two variables. Thus, the product-moment and its sample counterpart, the Pearson correlation, generally do not apply to non-linear relationships.

2.2. Spearman’s Rho

Spearman’s rho is also a popular measure of association.

Unlike the Pearson correlation, it also applies to non-linear relationship, thereby addressing the aforementioned limitation associated with the Pearson correlation.

Let q_i (r_i) denote the rankings of u_i (v_i), (1 ≤ i ≤n). Spearman’s rho is defined as:

ρ⌢=∑ni=1(qi−q.¯¯¯)(ri−r.¯¯¯)∑ni=1(qi−q.¯¯¯)2∑ni=1(ri−r.¯¯¯)2√′q.¯¯¯=1n∑i=1nqi,r.¯¯¯=1n∑i=1nri.

(4)

By comparing (1) and (4), it is clear that ρ⌢ is really the Pearson correlation when applied to the rankings (q_i , r_i) of the original variables (u_i, v_i). Since the rankings only concern the ordering of the observations, relationships among the rankings are always linear, regardless of whether the original variables are linearly related. Thus, Spearman’s rho not only has the same interpretation as the Pearson correlation, but also applies to non-linear relationships.

The Spearman ρ⌢ ranges between -1 and 1, with 1 and -1 indicating perfect positive (negative) correlation; when ρ⌢ =0 there is no association between the variables u_i and v_i. If ρ⌢ =1 then q_i= r_i, in which case,

u_i < u_j, v_i < v_j or u_i > u_j, v_i > v_j for all 1 ≤ i < j ≤ n.

(5)

If ρ⌢ =-1, then q_i=n-r_i+1, in which case,

u_i < u_j, v_i > v_j or u_i > u_j, v_i < v_j for all 1 ≤ i < j ≤ n.

(6)

Any two pairs of bivariate outcomes (u_i, v_i) and (u_j, v_j) that satisfy (5) or (6) are said to be concordant or discordant; that is, u_i and v_i are either both larger or both smaller than u_j and v_j. Thus, perfect positive (negative) correlation by Spearman’ rho corresponds to perfect concordance (discordance); that is, concordant (discordant) pairs (u_i, v_i) and (u_j, v_j) for all 1≤i<j≤n.

Example 2. Table 1 shows 12 observations of the bivariate outcome (u_i, v_i) as described in Example 1, and the ranks associated with these observations. Note that u_i and v_i are perfectly related, so their rankings are identical; that is, q_i= r_i.

Table 1

A sample of 12 bivariate outcomes (u_i, v_i) simulated with u_i= v_i⁹ and v_i from standard normal N (0,1).

u_i	0.26	1.49	1.39	0.65	-0.49	-1.38	1.168	0.87	-0.96	2.15	-0.03	-1.08
v_i	0	38.1	19.4	0.02	-0.002	-18.5	4.06	0.29	-0.68	971.6	0	-2.10
q_i(r_i)	6	11	10	7	4	1	9	8	3	12	5	2

In this example the Pearson correlation p⌢=0.531, while Spearman’s ρ⌢ =1. Thus, only the Spearman rho captures the perfect non-linear relationship between u_i and v_i.

Note that the Pearson correlation p⌢=0.531 has a higher upward bias than the product-moment correlation p=0.161; this occurs due to the small sample size, n=12. As sample size increases, p⌢ becomes closer to p, a property known as ‘consistency’ in statistics. For example, we also simulated (u_i, v_i) with n=1000 and obtained p⌢ =0.173, much closer to p.

Like the Pearson correlation, the Spearman’s rho in (4) is a statistic based on a sample. This sample Spearman rho is an estimate of the following population Spearman rho:

ρ = 12E[I(u_j > u_i)I(v_k < v_i)] − 3, for all 1 ≤ i < j < k ≤ n.

(7)

In Equation (7), E[I(u_j<u_i)I(v_k<v_i)] stands for the mathematical expectation of I(u_j<u_i)I(v_k<v_i) and I(u_j<u_i) (similarly I(v_k<v_i)) denotes an indicator with I(u_j<u_i)=1(0) if u_j<u_i. It can be shown that ρ⌢ =1(-1) if (u_i, v_i) are perfectly concordant (discordant) and vice versa.

Note that the sample Spearman’s rho in (4) is referred to as Spearman’s rho in the literature. Unlike the Pearson correlation, there is no formal name for the population Spearman’s rho in (7). In general, the lack of a formal name for the population version does not cause confusion, since it is usually clear which one is used within the context of a discussion. Like all statistics, the population version of a statistic is called a parameter in statistical lingo. The statistic and parameter serve different purposes. For example, only the parameter can be used in stating statistical hypotheses, such as the null hypothesis, H:ρ=0, for testing whether the population Spearman’s rho is 0. Reported values of Spearman’s rho by studies are always the sample Spearman rho.

2.3. Kendall’s Tau

Another alternative for non-linear association is Kendall’s tau.^[2] Like Spearman’s rho, Kendall’s tau also exploits the concept of concordance and discordance to derive a measure for bivariate outcomes. Unlike Spearman’s rho, it uses the notion of concordant and discordant pairs directly in the definition of this correlation measure.

Specifically, Kendall’s τ (sample version) is defined as:

τ⌢=nc−ndnt,nt=12n(n−1),nc=number of concordand pairs,nd=number of discordand pairs.

(8)

In the above, n_t = ½ n (n – 1) n 1 is the total number of concordant and discordant pairs in the sample. If n_c=n_t(n_d=n_t), then τ⌢ =1(-1) and vice versa. Also, if there is no association between u_i and v_i, then n_c and n_d should be close to each other and τ⌢ should be close to 0 (not exactly 0 due to sampling variability). Thus, like Spearman’s rho, τ⌢ =1(-1) corresponds to perfect concordance (discordance). A value of τ⌢ close to 0 indicates weaker or no association between the variables u_i and v_i.

Like the Pearson and Spearman correlation, the sample Kendall’s τ⌢ in (8) estimates the following population parameter:

τ = 2E[I(u_i < u_j)I(v_i < v_k)] − 1, for all 1 ≤ i < j ≤ n.

Like its sample counterpart, τ also ranges between -1 and 1. If (5) holds true for all pairs (u_i, v_i) and (u_j, v_j), then E[I(u_i<u_j)I(v_i<v_j)]=1 and τ=1. Likewise, if (6) holds true for all pairs, then E[I(u_i<u_j)I(v_i<v_j)]=0 and τ⌢ =-1. Thus, τ=1(-1) corresponds to perfect concordance (discordance). Finally, if u_i and v_i are independent, then E[I(u_i < u_j)I(v_i < v_j)]=½ and τ=0. Thus, τ=0 indicates no association between u_i and v_i, and vice versa.

Example 3. Consider the data in Example 2. The sample Kendall’s tau τ⌢ =-1. Thus, like Spearman’s rho, Kendall’s tau also provides a sensible measure of association for non-linearly related variables.

3. Agreement and measures of agreement

Agreement, or reproducibility, is another widely used concept for assessing the relationship among outcomes. As indicated in the Introduction, unlike variables considered in correlation analysis, variables considered for agreement must measure the same construct. Conversely, measures of correlation considered in Section 2 generally do not apply to agreement.

Example 4. Consider two judges who rate each subject from a study of 5 subjects sampled from a population of interest using a scale from 1 to 10. Let u_i and v_i denote the two judges’ ratings on the ith subject (1<i<5). Suppose that the judges’ ratings from the subjects are as follows:

(u_i, v_i):(1, 6), (2, 7), (3, 8), (4, 9), (5, 10).

Since u_i and v_i are linearly related, the Pearson correlation can be applied, yielding p⌢=1, indicating perfect correlation. However, the data clearly do not indicate perfect agreement; in fact, the two judges hardly agree with one another.

The poor agreement in this hypothetical example is due to bias in judges’ ratings. The mean ratings for the two judges are 3 (for u_i) and 8 (for v_i). Thus, despite the perfect correlation between the ratings, the two judges do not have good agreement because of bias in their ratings of the subjects; either u_i has downward or v_i has upward bias (or both).

The issue of bias does not apply to correlation because the variables considered for correlation generally measure different constructs and, thus, typically have different means. For the Pearson correlation, the sample means u. and v. are removed from the calculations of the correlation in (1), thus, the Pearson correlation is independent of differences between the (sample) means of the variables being correlated.

3.1. Intraclass correlation

Intraclass correlation (ICC) is a popular measure of agreement for continuous outcomes. Like the Pearson correlation, the ICC requires a linear relationship between the variables. However, it differs from the Pearson correlation in one key respect; the ICC also takes into account differences in the means of the measures being considered. In addition, the ICC can be applied to situations where there are three or more separate raters.

Consider a study with n subjects and assume each subject is rated by a different group of K judges. Let y_ik denote the rating of the i^th subject by the k^th judge (1 ≤ i ≤n, 1 ≤ k ≤K). The ICC is defined based on the following linear mixed-effects model:^{[3, 4, 5]}

yik=μ+βi+εik,1≤k≤K,1≤i≤n,βi∼N(0,σ2β),εik∼N(0,σ2).

(9)

In the above model, the fixed effect μ is the (population) mean rating of the study population over all possible K judges from the population of judges; that is, the random effect or latent variable. β_i represents the difference between the mean rating of the i^th subject and the mean rating of the study population μ. Thus, the sum u+β_i represents the mean rating of the i^th subject. The intraclass correlation (ICC) is defined as the variance ratio, pICC=σ2βσ2β+σ2, of the variance σ_β² of the mean rating of the subjects (u+β_i) to the total variance consisting of σ_β² plus the variance σ² of the judges.

If there are only two judges (K=2), then under the linear mixed-effects model in (9) the productmoment correlation between y_i1 and y_i2 is the same as the ICC; that is, Corr(yi1,yi2)=σ2βσ2β+σ2. Moreover, y_i1 and y_i2 have the same mean (μ) and variance (σ² ). Thus, in this special case, the ICC is the same as the product moment correlation (p_ICC= p). Note that this result is not a contradiction to the data in Example 4, since u_i and v_i do not have the same mean and thus the linear mixed-effects model in (9) does not apply to the data and the ICC no longer serves its intended purpose in this case. However, since differences in means between judges’ ratings decrease the ICC, this agreement index may still be applied in this situation to indicate poorer agreement. Follow-up analyses are necessary to determine whether poor agreement is due to bias or large variability or both between the judges.

Example 5. Consider again Example 4 and let y_i1=u_i and y_i1=v_i. By fitting the model in (9) to the data, we obtain estimates σ⌢2β = 0 and σ⌢2 =9.167. Thus, the (sample) ICC based on the data is p⌢ICC =0, which is quite different from the Pearson correlation. Although the judges’ ratings are perfectly correlated, agreement between the judges is extremely poor.

Note that p⌢ICC is not a valid measure of agreement between y_i1 and y_i2 for the data in Example 5, since the assumption of a common mean between y_i1 and y_i2 is not met by the data. However, it is precisely this assumption that makes p⌢ICC totally different from the Pearson correlation p⌢ =(1). We may revise the model in (9) to account for the bias in the judges’ ratings to consider:

yik=μk+βi+εik,1≤k≤K,1≤i≤n,βi∼N(0,σ2β),εik∼N(0,σ2).

(10)

where the added fixed-effect μk accounts for the difference between the two judges. By fitting the above model, we obtain estimates σ⌢2β =1.256, σ⌢2 =0, μ⌢1=3 and μ⌢2 =5. Once accounting for bias, the two judges have perfect agreement. The model in (10) also provides mean ratings μ⌢K for the judges. The positive estimate σ⌢2β describes the variability among the subjects. Although the correct model for the data, the ICC calculated from the model in (10) no longer has the interpretation as a measure of agreement. In fact, σ⌢2βσ⌢2β+σ⌢2=1, the same as the Pearson correlation p⌢ =1 as we have calculated in Example 4.

Note since p_ICC≥0 we can either reverse code some of the judges’ ratings or use a different index, such as the concordance correlation, discussed below.

3.2. Concordance correlation

The concordance correlation (CCC) is another measure of agreement which, unlike the ICC, does not assume a common mean for judges’ ratings at the outset, so it can be used to assess both the level of agreement and the level of disagreement. However, a major limitation of the CCC is that it only applies to two judges at a time.

Consider a study with n subjects and assume each subject is rated by a different group of two judges. Let y_ik again denote the rating of the i^th subject by the k^th judge (1≤i≤n, 1≤k≤2). Let μ_k=E(y_ik) and σ_k² =Var(y_ik), denoting the mean and variance of y_ik, and σ₁₂=Cov(y_i1, y_i2), denoting the covariance between y_i1 and y_i2. The CCC is defined as:^[6]

PCCC=2σ12σ2β+σ2β+(μ1−μ2)2.

(11)

Unlike the ICC, no statistical model is assumed in the definition of p_CCC. Further, the two judges can come from two different populations of judges with different means and variances.

The CCC p_CCC has a nice decomposition, p_CCC=pC_b, where p is the product-moment correlation in (2) and C_b is called the bias correction factor given by:

Cb=2σ1σ2+σ2σ1+(μ1−μ2)2σ1σ2.

(12)

It can be shown that p_CCC=1(-1) if and only if p=1(-1), μ₁=μ₂ and σ₁²=σ₂².^[6] Thus, p_CCC=1(-1) if and only if y_i1 = (10) y_i2(y_i1=-y_i2), that is, when there is perfect agreement (disagreement). The bias correction factor C_b(0≤C_b≤1) in (12) assesses the level of bias, with smaller C_b indicating larger bias. Thus, unlike the ICC, poor agreement can result from low correlation (small p) or large bias (small C_b).

Example 6. Consider again Example 5. The (sample) mean and variance of y_i1, and the (sample) correlation between y_i1 and y_i2 are given by : μ⌢1=3, μ⌢2=8, σ⌢21=2.5, σ⌢22=2.5 and σ⌢12=1. Thus, it follows from (11) that p⌢ccc=2σ⌢12σ⌢21+σ⌢22+(μ⌢1−μ⌢2)2= 0.053 . We can also obtain p⌢_CCC by using the decomposition result, which in our case yields p⌢=1, C⌢b=0.0533 and p⌢CCC=p⌢C⌢b = 0.0533.

Note that unlike correlation the issue of linear versus non-linear association does not arise when assessing agreement. This is because good agreement requires an approximate linear relationship between the outcomes. For example, in the case of two raters, good agreement requires that y_i1 and y_i2 are close to each other, such as y_i1 = y_i2 in the case of perfect agreement.

4. Discussion

We discussed the concepts of agreement and correlation and described various measures that can be used to assess the relationships among variables of interest. We focused on the measures and methods for continuous outcomes. For non-continuous outcomes, different methods must be applied. For example, for categorical outcomes a different version of Kendall’s tau, known as Kendall’s tau b can be used for assessing correlation and Kappa can be used for assessing agreement.^[7]

Biography

Ms. Jinyuan Liu obtained her bachelor’s of science degree in statistics from Nanjing University of Posts and Telecommunications in 2015. She is currently a master’s student in the Department of Biostatistics and Computational Biology at the University of Rochester in New York, USA. Her research interests include categorical data analysis, machine learning, and social networks.

Funding Statement

The work was supported in part by a grant (GM108337) from the National Institutes of Health and the National Science Foundation (Tang and Tu) and a pilot grant (UR-CTSI GR500208) from the Clinical and Translational Sciences Institute at the University of Rochester Medical Center (Feng and Tu).

Footnotes

Conflict of interest statement: The authors report no conflict of interest.

Contributed by

Authors’ contributions: All authors worked together on this manuscript. In particular, JYL, WT and XMT made major contributions to the section on correlation, GQC, YL and CYF made major contributions to the section on agreement, and JYL and XMT drafted and finalized the manuscript. All authors read and approved the final manuscript.

References

1. Stigler SM. Francis Galton’s Account of the Invention of Correlation. Statist Sci. 1989;4(2):73–79. doi: 10.1214/ss/1177012580. [CrossRef] [Google Scholar]

2. Kowalski J, Tu XM. Modern Applied U Statistics. New York: Wiley; 2007. [Google Scholar]

3. Lu N, Chen T, Wu P, Gunzler D, Zhang H, He H, et al. Functional response models for intraclass correlation coefficients. Applied Statistics. 2014;41:2539–2556. doi: 10.1080/02664763.2014.920780. [CrossRef] [Google Scholar]

4. McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods. 1996;1:30–46. doi: 10.1037/1082-989X.1.4.390. [CrossRef] [Google Scholar]

5. Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull. 1979;86(2):420–428. [PubMed] [Google Scholar]

6. Lin LI. A concordance correlation coefficient to evaluate reproducibility. Biometrics. 1989;45(1):255–268. [PubMed] [Google Scholar]

7. Tang W, He H, Tu XM. Applied Categorical and Count Data Analysis. Boca Raton, FL: Chapman & Hall/CRC; 2012. [Google Scholar]